Not sure if this is cogent since my application is autonomous robots in actual
hardware, but maybe useful…
I used OpenCV with a carrier board ("StereoPi") for the Raspberry Pi Compute
Module that breaks out both camera ports on the Pi. I automated face
recognition with code that leveraged OpenCV that I came to find from one Adrian
Rosebock (pyimagesearch.com) that employed Haar Cascades to determine there was
a face present. Once a face is detected, it sends the center-of-face data to
another Pi (the robots have three Pis in them – "cores" – a vision acquisition
"core", a language "core" and vision processing core). The vision processing
core (depending on the state the robot is in) takes this face positioning data,
chews on it and sends the corresponding servo signals to the motor core that
controls the head and eyes, and the robot follows you with its gaze and head
movements. So in theory, face *detection* and tracking are always functionally
available, but may be overridden/ignored by other behavioral commands/statuses.
The language processing side of things is always listening (I use python speech
recognition with PocketSphinx as the recognizer which works surprisingly well)
and now has several hundred routines it can engage depending on what it hears,
and some conflict resolution and buffering code in case responses to one phrase
would interfere with ongoing responses playing out).
The system is set up so that if I use a phrase like "my name is", or "I'd like
to introduce you to" (and several similar phrases that are recognized by a
fuzzy-logic kind of similarity finder I wrote), *AND* it can tell a face is
present, It can filter out the name given, if any. Then a few things happen –
first, the language processor confirms the name by speaking "Hello <name> - did
I get that right?" and listens for a variety of words that are either affirming
or denying.
On affirmation, the system immediately begins taking snapshots every 10 frames
and stores them in a folder (the new faces dataset) of the person's name plus
the date and time as a numeric string (Dave-202202251623 for example). Once
either the person exits the view for more than 100 frames (would-be 10
snapshots) or the system gains 100 actual face snapshots, it hands off those
images to another of the scripts from Adrian Rosebock (encode_faces.py) that
encodes the faces and turns the whole bunch into a pickle, which is then
appended to the bigger pickle that all the other known faces are in… The name
and data are also written to the database of "people known", where additional
data is written over time as interactions with that person accrue.
So I'm not sure if this answers your question about integrating it into the
speech subsystem – I basically have the audio input and processing, audio
output and visual input and processing all running in parallel on separate
physical SBCs, which all talk to each other via ZeroMQ (or PyZMQ specifically).
It works very well, reasonably fast (especially given it only runs on Pi 4/8gig
SBCs) and provides people interacting with the unmistakable feeling that the
robot sees them, responds to their movements and speech, etc., and remembers
them.
The drawback that I haven't done anything with in the past year or so, but has
a relatively easy fix – is that the pickle data for a given person ages (my
grandkids are no longer reliably recognized since they were 3 and 5 when I
first implemented that build, and they are 6 and 8 now) – so I need to add a
routine that occasionally updates the images silently in the background in the
recognition pickle to keep up with changes… but I've not had the time I wanted
to to do these things…
If any of this gives you anything useful to pick from, I can get you code,
original source and my custom stuff. It's all Python, so I'm guessing you
should be good with that.
Dave
From: [email protected] <[email protected]> On Behalf Of Linas
Vepstas
Sent: Friday, February 25, 2022 3:33 PM
To: opencog <[email protected]>
Subject: Re: [opencog-dev] Vision for pi_vision and AGI/atomspace
Hi Mark,
Preface for anyone else reading this: Mark is dusting off the old Hanson
Robotics code for Eva. One of the subsystems was face-tracking. When your
webcam was calibrated correctly, then Eva had this uncanny ability to look at
you from out of the screen: her eyes would track your position. It was really
pretty cool, as you really got the sense she was looking at you.
Anyway, it seems that Mark has this code working again, or almost working? A
related gotcha is some of the camera-transforms in Blender needed to be
adjusted, to accurately reflect that you sit about an arms-length away from
your computer screen, which is small on laptops but big on desktops, etc. so
eye tracking didn't work right if all these dimensions weren't accounted for.
It was kind of tricky to get it all right. But when it worked, it was really
cool and even spine-tingling.
What about face recognition? This too worked, in a limited setting: she could
recognize a handful of faces, and pull out the names of those people from a
database. There are then three questions; how did this work, back then, how
can it be made to work in the short term, and what is the correct long-term
architecture?
First part: "how did it work back then"? See
https://github.com/opencog/ros-behavior-scripting The code might be bit-rotted,
but it worked. (There was some radical meatball surgery towards the end; this
might need to be revisited.) The general philosophy, back then, was that:
* The 3D locations of objects (such as faces) would be stored in the opencog
"spacetime server".
* The only reason to do this was so that there could be an API for verbal
propositions: near, far, next to, behind, in front of, to the left of, etc.
that the language subsystem could use. That API was never built.
* The AtomSpace would hold all information about everything, e.g face #135 is
actually Ben who is NN years old, lives in YY, loves robots, and is standing
"next to" David (as reported by the space-server)
* Why the AtomSpace? Because its the obvious place where current sensory info:
sight & sound, can be integrated in with long-term knowledge and memories, as
well as the dialog/language subsystem, as well as controlling movement and
behaviour (turn left, right, blink and smile..)
* Unfortunately, integrating the senses together with the background knowledge
is hard. It was done in an ad hoc manner, it was under-documented, hard to use,
hard to understand. An adequate framework was never developed. This is not
something one college student can knock out in a few weeks. The foundation for
that framework is in the ros-behavior-scripting git repo. Fragments are in
other places, I'd have to dig them up.
So ... back to the question: face recognition: Sure. Whatever. If you have a
module that can recognize faces, then sure, whatever, have it forward that info
to the AtomSpace. That's the easy part. The hard part is to integrate it into
the speech subsystem. So, when a new person appears in front of the camera,
and says "Hi, my name is Mark", something has to extract the word "Mark",
realize that "Mark" is someone's name, understand that there is probably a
real-time correlation between that name and what the camera is seeing, take a
snapshot of what the camera is seeing, and permanently tag that image with the
name "Mark". To remember it. So that, minutes later, when Mark leaves the room
and comes back, or months later, after a reboot, Eva still remembers what Mark
looks like, as well as his favorite color, sports-team, childhood hero,
mother's maiden name, last four digits of his soc sec and bank account #.
I think all that is doable, and there are many different ways of doing the
above, from quick short hacks to complicated theoretically-correct approaches
... but .. this email is too long, so, let me leave it at that.
-- Linas
On Fri, Feb 25, 2022 at 8:16 AM Mark Wigzell
<[email protected]<mailto:[email protected]>> wrote:
Hi folks, my subject stems from having recently done a deep-dive into the
pi_vision implementation. The original face detection and tracking was rusted,
so I revamped it. In doing so I added in a hook for eventually augmenting the
"new_face" message with some face recognition. I was informed that rather than
splicing in some face detection algorithm at the pi_vision level, the "vision"
would be to have the image elements reach the atomspace, and thus allow
recognition to occur at a more basic level.
Therefore, pursuant to the above, I'm asking for a high level description of
how AGI vision could be accomplished. Perhaps we can also address the question
of why face detection and tracking are "ok" but face recognition is not? Maybe
all processing should be done at a lower level?
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AYNxawVTjbn5sQXp7AjToj1xteyCnCibrBO7TZwDDsSQ%40mail.gmail.com<https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AYNxawVTjbn5sQXp7AjToj1xteyCnCibrBO7TZwDDsSQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com<https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CH0PR20MB3980707E0E973F770A5D3196B53E9%40CH0PR20MB3980.namprd20.prod.outlook.com.