Re: [opencog-dev] Vision for pi_vision and AGI/atomspace

Linas Vepstas Fri, 25 Feb 2022 16:07:29 -0800

Hi Dave,

Thank you for that nice note!  I want to splice in some comments with my
own real-world experience ....

On Fri, Feb 25, 2022 at 3:42 PM xanatos xanatos.com <[email protected]>
wrote:

> Not sure if this is cogent since my application is autonomous robots in
> actual hardware, but maybe useful…
>
>
>
> I used OpenCV with a carrier board ("StereoPi") for the Raspberry Pi
> Compute Module that breaks out both camera ports on the Pi.  I automated
> face recognition with code that leveraged OpenCV that I came to find from
> one Adrian Rosebock (pyimagesearch.com) that employed Haar Cascades to
> determine there was a face present.
>

The code we used also rested on a Haar cascade. It "worked great" if you
were in conventional office lighting, and faced the camera squarely.  It
failed if you turned quarter-face, or showed a profile. It failed if your
office had windows, and the shade wasn't drawn. It failed in direct
sunlight. Outdoors. In stage demo and trade-show lighting conditions.  We
considered a medical-training robot application, where the first responder
would be kneeling over the robot-dummy, and so their face would be at
right-angles to the camera.  The Haar cascade can't do that. (We never did
find a better solution, either, at least while I was there.)

The Haar cascade was able to measure the distance between the eyes, and
thus able to estimate the distance to the face, and thus able to get the
parallax right when steering the robot eyes to focus on the right spot.
(The two eyes in blender move automatically, so in principle, you could
have a cross-eyed animation, or a roll-your-eyes animation, but we never
did that.) The depth was noisey. We used an alpha-beta filter to smooth out
the jitter.

I've heard vague intimations that neural nets can do better, but if so, I
suspect all available systems are proprietary and expensive (and wouldn't
run on a Pi, anyway)  I do have some general ideas on how to improve on
this situation, but it would blow up this email.

> Once a face is detected, it sends the center-of-face data to another Pi
> (the robots have three Pis in them – "cores" – a vision acquisition "core",
> a language "core" and vision processing core).  The vision processing core
> (depending on the state the robot is in) takes this face positioning data,
> chews on it and sends the corresponding servo signals to the motor core
> that controls the head and eyes, and the robot follows you with its gaze
> and head movements.  So in theory, face **detection** and tracking are
> always functionally available, but may be overridden/ignored by other
> behavioral commands/statuses.
>
>
>
> The language processing side of things is always listening (I use python
> speech recognition with PocketSphinx as the recognizer which works
> surprisingly well)
>

I never experimented directly with this, but everyone turned up their noses
at this, and opted for a real-time internet connection to google speech.
In retrospect, I'm wondering if this is because all the developers either
had a heavy foreign accent, or had a habit of slurring their speech and/or
mumbling. At any rate, trade-show floors are problematic, what with the
sonic assault of neighboring booths. Questions from the audience via
microphones are also a problem, although there, you could get a direct
audio cable from the mixing  board that the stage techs were running.

The point here is that in natural settings, audio quality is an issue. I'm
not aware of the current state-of-the-art with regards to neural nets. I
suspect that, again, the solutions are proprietary, expensive, and don't
run on Pi. But I dunno, i'm pretty much 100% totally unplugged from that
world.

> and now has several hundred routines it can engage depending on what it
> hears, and some conflict resolution and buffering code in case responses to
> one phrase would interfere with ongoing responses playing out).
>
>
>
> The system is set up so that if I use a phrase like "my name is", or "I'd
> like to introduce you to"
>

We had three versions. One was to feed text into AIML. There's an
AIML-to-AtomSpace converter. It worked as well as "native" AIML chatbots,
except that it took several minutes on startup, to load the database.  That
was almost fatal.

Its easy, "trivial", to write custom response rules in AIML. If I recall
the syntax, it would be something like "PATTERN:my name is *"
"RESPONSE:pleased to meet you $star-1"

The second was ChatScript. That bypassed the atomspace entirely.

The third was a chat-script-inspired domain-specific language called
"ghost".  The intent was that authors would be able to write rules such as
"RESPONSE: please to meet you $star-1 BLINK GAZE-AT $star-1 BLINK SMILE"  I
guess it worked. I never saw a working demo.  The actual authors were drama
students, with no software experience: they felt it was "difficult
programming", they were used to type-written scripts for TV shows and if it
wasn't done on a word-processor, it was "programming".  This was tough.
Only one person was good at this, Audrey LeeAnn Brown, and she had a
background in C++.  And I don't think she liked ghost. I think there were
some PhD students who did manage to get something going for LovingAI. But I
think they too side-stepped the complexity.

I later saw a demo from a game company. It was actually fairly impressive:
they had developed a GUI that allowed game designers to drag-n-drop their
way through directed NPC interactions.  Basically, the NPC is trying to
tell the player to go to this-n-such spaceport and meet some sketchy
space-pirate to get gold, weapons, etc.  The dialog tree automated a lot of
the low-level interaction, yet allowed fine-grained control.  In this
sense, the GUI's that have been developed for games are light-years beyond
what you can do with AIML or ChatScript; the main problem is that they're
expensive, proprietary, and have lots of core issues that would need to be
fixed to apply them to robots.

Open source is great for operating systems, compilers and databases. Not so
much for everything else.

(and several similar phrases that are recognized by a fuzzy-logic kind of
> similarity finder I wrote), **AND** it can tell a face is present, It can
> filter out the name given, if any.  Then a few things happen – first, the
> language processor confirms the name by speaking "Hello <name> - did I get
> that right?" and listens for a variety of words that are either affirming
> or denying.
>

If you're walking that path, .. well, this is what AIML is really good at.
Or, I guess, ghost?

>
>
> On affirmation, the system immediately begins taking snapshots every 10
> frames and stores them in a folder (the new faces dataset) of the person's
> name plus the date and time as a numeric string (Dave-202202251623 for
> example).  Once either the person exits the view for more than 100 frames
> (would-be 10 snapshots) or the system gains 100 actual face snapshots, it
> hands off those images to another of the scripts from Adrian Rosebock
> (encode_faces.py) that encodes the faces and turns the whole bunch into a
> pickle, which is then appended to the bigger pickle that all the other
> known faces are in…  The name and data are also written to the database of
> "people known", where additional data is written over time as interactions
> with that person accrue.
>
>
>
> So I'm not sure if this answers your question about integrating it into
> the speech subsystem – I basically have the audio input and processing,
> audio output and visual input and processing all running in parallel on
> separate physical SBCs, which all talk to each other via ZeroMQ (or PyZMQ
> specifically).
>

The point of using ROS was that it allowed everything to be "modular", at
least in theory. That you could replace one subsystem by another.  Much
easier said than done.

ROS uses UDP to "talk". For ROS2, they thought about using ZMQ but rejected
it in favor of something else.  I forget what.

>
>
> It works very well, reasonably fast (especially given it only runs on Pi
> 4/8gig SBCs) and provides people interacting with the unmistakable feeling
> that the robot sees them, responds to their movements and speech, etc., and
> remembers them.
>

Moore's law.

So, ahh, one person who should have known better ordered the best,
highest-resolution webcams they could find. 1280x1024 or something. You
could only plug two of them into a USB hub before the USB hub was
overwhelmed. And the CPU attached to that could barely keep up with the
frame rate. Despite this obvious hardware-fail, there was tremendous
resistance to down-scaling to a far more practical 640x480.  Add to that a
power, heat and cooling budget. Ugh.

Managing engineers is like herding cats.  Or pushing rope. Something like
that.

>
>
> The drawback that I haven't done anything with in the past year or so, but
> has a relatively easy fix – is that the pickle data for a given person ages
> (my grandkids are no longer reliably recognized since they were 3 and 5
> when I first implemented that build, and they are 6 and 8 now) – so I need
> to add a routine that occasionally updates the images silently in the
> background in the recognition pickle to keep up with changes…  but I've not
> had the time I wanted to to do these things…
>

For more-or-less all of the performances, there was a robot operator who
sat in the audience, monitoring the system in case it went haywire,
over-riding any responses that were inappropriate.  Putting together a good
GUI that allowed the robot operator to do this, running on a tablet, is
non-trivial. (It was a website, with assorted javascript attached to
various bits and pieces of the processing pipeline.)

For pretty much anything non-trivial running in the atomspace, one needs
some kind of visualization GUI to see what's going on.  We do not have
one.  I personally use printf for everything, because I can. But its not,
umm, usable by anyone else.

>
>
> If any of this gives you anything useful to pick from, I can get you code,
> original source and my custom stuff.  It's all Python, so I'm guessing you
> should be good with that.
>
>
>
> Dave
>
>
>
>
>
> *From:* [email protected] <[email protected]> *On Behalf Of
> *Linas Vepstas
> *Sent:* Friday, February 25, 2022 3:33 PM
> *To:* opencog <[email protected]>
> *Subject:* Re: [opencog-dev] Vision for pi_vision and AGI/atomspace
>
>
>
> Hi Mark,
>
>
>
> Preface for anyone else reading this: Mark is dusting off the old Hanson
> Robotics code for Eva.  One of the subsystems was face-tracking. When your
> webcam was calibrated correctly, then Eva had this uncanny ability to look
> at you from out of the screen: her eyes would track your position. It was
> really pretty cool, as you really got the sense she was looking at you.
>
>
>
> Anyway, it seems that Mark has this code working again, or almost working?
> A related gotcha is some of the camera-transforms in Blender needed to be
> adjusted, to accurately reflect that you sit about an arms-length away from
> your computer screen, which is small on laptops but big on desktops, etc.
> so eye tracking didn't work right if all these dimensions weren't accounted
> for. It was kind of tricky to get it all right.  But when it worked, it was
> really cool and even spine-tingling.
>
>
>
> What about face recognition? This too worked, in a limited setting: she
> could recognize a handful of faces, and pull out the names of those people
> from a database.  There are then three questions; how did this work, back
> then, how can it be made to work in the short term, and what is the correct
> long-term architecture?
>
>
>
> First part: "how did it work back then"? See
> https://github.com/opencog/ros-behavior-scripting The code might be
> bit-rotted, but it worked. (There was some radical meatball surgery towards
> the end; this might need to be revisited.)  The general philosophy, back
> then, was that:
>
> * The 3D locations of objects (such as faces) would be stored in the
> opencog "spacetime server".
>
> * The only reason to do this was so that there could be an API for verbal
> propositions: near, far, next to, behind, in front of, to the left of, etc.
> that the language subsystem could use. That API was never built.
>
> * The AtomSpace would hold all information about everything, e.g face #135
> is actually Ben who is NN years old, lives in YY, loves robots, and is
> standing "next to" David (as reported by the space-server)
>
> * Why the AtomSpace? Because its the obvious place where current sensory
> info: sight & sound, can be integrated in with long-term knowledge and
> memories, as well as the dialog/language subsystem, as well as controlling
> movement and behaviour (turn left, right, blink and smile..)
>
> * Unfortunately, integrating the senses together with the background
> knowledge is hard. It was done in an ad hoc manner, it was
> under-documented, hard to use, hard to understand.  An adequate framework
> was never developed. This is not something one college student can knock
> out in a few weeks. The foundation for that framework is in the
> ros-behavior-scripting git repo. Fragments are in other places, I'd have to
> dig them up.
>
>
>
> So ... back to the question: face recognition:  Sure. Whatever. If you
> have a module that can recognize faces, then sure, whatever, have it
> forward that info to the AtomSpace.  That's the easy part.  The hard part
> is to integrate it into the speech subsystem.  So, when a new person
> appears in front of the camera, and says "Hi, my name is Mark", something
> has to extract the word "Mark", realize that "Mark" is someone's name,
> understand that there is probably a real-time correlation between that name
> and what the camera is seeing, take a snapshot of what the camera is
> seeing, and permanently tag that image with the name "Mark". To remember
> it. So that, minutes later, when Mark leaves the room and comes back, or
> months later, after a reboot, Eva still remembers what Mark looks like, as
> well as his favorite color, sports-team, childhood hero, mother's maiden
> name, last four digits of his soc sec and bank account #.
>
>
>
> I think all that is doable, and there are many different ways of doing the
> above, from quick short hacks to complicated theoretically-correct
> approaches ... but .. this email is too long, so, let me leave it at that.
>
>
>
> -- Linas
>
>
>
> On Fri, Feb 25, 2022 at 8:16 AM Mark Wigzell <[email protected]>
> wrote:
>
> Hi folks, my subject stems from having recently done a deep-dive into the
> pi_vision implementation. The original face detection and tracking was
> rusted, so I revamped it. In doing so I added in a hook for eventually
> augmenting the "new_face" message with some face recognition. I was
> informed that rather than splicing in some face detection algorithm at the
> pi_vision level, the "vision" would be to have the image elements reach the
> atomspace, and thus allow recognition to occur at a more basic level.
>
>
>
> Therefore, pursuant to the above, I'm asking for a high level description
> of how AGI vision could be accomplished. Perhaps we can also address
> the question of why face detection and tracking are "ok" but face
> recognition is not? Maybe all processing should be done at a lower level?
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AYNxawVTjbn5sQXp7AjToj1xteyCnCibrBO7TZwDDsSQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AYNxawVTjbn5sQXp7AjToj1xteyCnCibrBO7TZwDDsSQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
>
>
> --
>
> Patrick: Are they laughing at us?
>
> Sponge Bob: No, Patrick, they are laughing next to us.
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com
> <https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CH0PR20MB3980707E0E973F770A5D3196B53E9%40CH0PR20MB3980.namprd20.prod.outlook.com
> <https://groups.google.com/d/msgid/opencog/CH0PR20MB3980707E0E973F770A5D3196B53E9%40CH0PR20MB3980.namprd20.prod.outlook.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35x4vEj_GJvFdOYNc7-4vfnjipW%2BdFv1LQDc%3DcSpTOTOA%40mail.gmail.com.

Re: [opencog-dev] Vision for pi_vision and AGI/atomspace

Reply via email to