Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Linas Vepstas Wed, 15 Sep 2021 10:55:08 -0700

On Wed, Sep 15, 2021 at 6:59 AM Adrian Borucki <[email protected]> wrote:
>
> Okay, some clarification is needed because there is a sentence
> > Now, if you have some filter collection that is able to detect eyes, mouths 
> > and foreheads
>
> That suggests using some pre-existing (i. e. human-engineered) solution to 
> find things like eyes and mouths.

Ah, my mistake then. The intent was to illustrate a concept of a
"shape grammar": a means of describing the relationships of things in
2D, 3D or N-dimensional spaces. The "things" have labelled edges
connecting them; the "grammar" is what you get if you cut the edges in
half: a "thing" with a collection of labelled half-edges
("connectors").

If the "things" can be connected up in a grammatically-valid way, then
one has some assurance that a "face" was correctly recognized, because
all the parts are where they should be.

I used "face" as an example, because it seemed like it was easy to
explain. Of course, there is a recursion problem: how do you know that
something is an eye or a mouth? Those problems are also solved in a
similar fashion: a networked arrangement of filters -- a graph with
labelled edges -- having a grammar to it.

I'll try to fix the README to make this more clear

> That said, I’ve looked again into the README and see that you specifically 
> mention segmentation as an image processing step.

Ugh. I mentioned that only as one possible "quick hack" (scaffolding)
to get a proof-of-concept working. It would have to be replaced by a
proper (learned) filter set for the final version.  (Of course, the
learner might learn how to segment, but that is an unrelated result.)

> That makes sense, as segmentation means assigning a label to each pixel of 
> the image. That means everything is going to be connected to something (in 
> the worst case that something being the background).

No, I want to go very much in the opposite of that direction. I do NOT
want any pixels anywhere in the  pipeline, and certainly not any
labelled pixels.

When I say "filter", I am envisioning a detector, for example,
something that says "the upper half of the visual field is blue and
the lower half of the visual field is green", and this gives a one-bit
result: true or false. This is not quite a primitive filter, but
rather is composed of some filters for hue and maybe brightness, and
some other filters that accept or reject upper and lower parts of the
visual field.

Exactly what sequence of image operations it is composed of would have
to be learned (rather than hand built).  It would be learned by
observing many photos of outdoor scenes, and statistically noting that
blue is always above (even in photos of city scenes) and that green is
often below (in nature photos).  Heck -- just being able to detect
"blue is above" and converting that into a one-bit true-false value
becomes an indicator that it is an outdoor scene. The "parsed image"
is the combination of the grammatical elements, the linkage that there
is blue above, and something else below (that is, a vertical change in
hue or brightness or saturation -- a somewhat sharp change -- perhaps
one with many sharp but randomly oriented derivatives.).

These filters need to be pixel-independent: I want to avoid the
silliness of having to write 1024 different filters, each having the
horizon in a slightly different pixel position. This means that the
filters really need to be wavelet filters, so that relative sizes and
scales are handled automatically.

... at least, that is the long-run idea. For bring-up and debugging,
almost any and all hacks are allowed, as otherwise rapid development
and testing is impossible.

> All in all, I think using a pre-existing segmentation model would for now 
> simplify the project but that is your call to make of course.

Heh. Well, not "my call" -- this needs to be a collaborative project,
and I have no desire to project an authoritarian personality.  So,
more like "your call", but I want you to make the right decision,
based on the understanding of what the project actually is.  Given
that this is experimental and exploratory, it is entirely normal that
the process  will be filled with bad decisions and failed designs.

For bring-up, to develop and prove that a shape grammar can actually
be learned, I suppose that some pre-existing segmentation model might
be OK, maybe. It makes me nervous, though, because it builds in a
component that might be hard to remove later.  Also, I think that
learning the shape grammar is easy.  People have talked about shape
grammars for 50 years, its not a new or novel concept; it should not
be that hard. The hard part is learning filter sequences that produce
useful outputs.

... and my philosophy of development is to focus on the hard parts
first. It's always easy to do the easy parts later.

> I can’t really opine about anything related to doing the “classic” Computer 
> Vision — my only guess is that you can probably hook up OpenCV with AtomSpace.

Yes, and that would be an important part of the project. The tricky
part is how to not waste too much time on this -- how to hook up just
enough to get the basic ideas working.

This means picking half-a-dozen or a dozen basic image operations --
hue, brightness filters, maybe some edge detectors, laplacians,
threshold filters -- and figuring out how to compose them together. In
Atomese, it would look something like this:

(GreaterThanLink 0.5 ; select blue values
    (HueFilterLink (Number 0.0 0.0 1.0)   ; RGB, no red, no green, only blue
         (HaarWavelet (Number 0 1)  ; lowest order Haar in vertical
direction; none in horizontal
              (VariableNode "x"))))

The above specifies a filter arrangement in Atomese, that would get
bound to a specific OpenCV pipe, when (Variable "x") is bound to a
webcam or photo.  The above is just an example -- the learning process
would attempt different combinations of such things, and vary the
parameters, looking for "meaningful" pipelines and parameters.

Note that there are no pixels and no segmentation in the above. I
guess we could have a (ConvexContingousRegionLink ...) that detects a
convex group of pixels that are mostly the same color... but this does
not seem required right now.

In terms of AI, this is again nothing new: people have been writing
evolutionary algorithms to automatically discover these kinds of
processing pipelines, for many decades. At least, they were before
deep-learning. I think the progress in deep learning has halted work
on such ideas, because the neural nets work so fast and so well. So
what i'm proposing above is a big step backwards -- a big step
backwards in time, a big step backwards in computational ability,
compared to neural nets. The hoped-for gain is to have explicit
symbolic control over the elements in the pipeline. The learned
pipelines and parameters may be pretty random-looking, but they will
have an explicit symbolic representation, thus making them open to
reasoning, inference, deduction, and assorted abstract symbolic
manipulations, which neural nets cannot do.

> Also a handcrafted segmentator is still a human-engineered solution — I don’t 
> know if it’s really simpler because it requires more domain-specific 
> knowledge to understand and modify and is not going to be very robust.

Right. So maybe these should be avoided.

> Anyway, for now I don’t see much more to discuss actually, when you have 
> decided what data to use we can just move on to implementing the basic 
> functionality.
>
> By the way there is also research into unsupervised segmentation with models 
> like MONet or GENESIS that could be trained on arbitrary data to try to 
> figure out what things to segment as “anonymous" objects.

I saw that other email thread, I'll respond to it later (a few days,
maybe)  We do need a source of test images. This could be (for the
above example) a collection of out-door photos with blue skies in
them.   But maybe also a collection of children's toys on a table,
with photos taken from many different angles -- the different angles
would cause the filters to learn about objects in space.  A second,
more difficult training set would involve the same toys, rearranged in
different locations.

--linas

> It is still in fairly early stages though — those models handle just 64x64 
> pixel images (now with colour, thankfully) and of course are not particularly 
> cheap to train… on an easy dataset perhaps using one of the findable 
> checkpoints from such pre-trained models would work, that would have to be 
> tested.
>
> On Wednesday, 15 September 2021 at 04:53:18 UTC+2 linas wrote:
>>
>> Trimming back the first part of the conversation...
>>
>> On Tue, Sep 14, 2021 at 9:09 AM Adrian Borucki <[email protected]> wrote:
>>
>> >
>> > Here’s the part I have questions about: how do you deal with the fact that 
>> > the regions won’t often be connected?
>>
>> I don't understand the question. What regions? Where did they come
>> from? What do you mean by "region"?
>>
>> > I am familiar with an idea of using Region Connection Calculus mentioned 
>> > in places like “Symbol Grounding via Chaining of Morphisms” and chapter 17 
>> > on spatio-temporal inference from EGI vol. 2.
>>
>> I'm not familiar with this. What is a "region"?
>>
>> > And it seems you have to use fuzzy versions of these relationships because,
>>
>> Sorry, fuzzy version of what relationship?
>>
>> > using the face grammar example, you won’t get a situation where, for 
>> > instance, detected eye regions
>>
>> Detecting eyes will be very hard; that won't be possible until a
>> rather large and complex software stack is working. That's why I
>> suggested starting with something simple -- for vision, detecting
>> tricolor flags in canonical position. Or maybe a video camera aimed at
>> a room or a sidewalk or street where there is low activity. For audio,
>> perhaps shifts in volume and frequency distribution.
>>
>> I dunno -- do I need to try to think of other simple data streams? I
>> guess so ... Some people have recommended that video games be used as
>> input, but I really don't like that ... it seems too artificial. It's
>> problematic, for multiple reasons. What other kind of visual input is
>> simple enough to process, to be debuggable, as a proof-of-concept?
>>
>> > (like bounding boxes from an object detector) are exactly connected 
>> > together — there is going to be some distance in between.
>>
>> What bounding boxes? Why would bounding boxes be needed? What would
>> you do with them?
>>
>> > So how do you deal with this?
>>
>> You splatted a bunch of questions without defining any of the
>> terminology, so I don't know how to respond... you seem to be thinking
>> of something very different from what I'm thinking of ... but I can't
>> tell what that is ...
>>
>> > The STI chapter mentions certain computational difficulties with the fuzzy 
>> > approach and proposes that using some crude assumptions you could have 
>> > something that could then be trained on a dataset to further improve it.
>>
>> What STI chapter? What fuzzy approach? Why do we need fuzzy-anything?
>> I thought I spelled out a rather specific, precise algorithm; the word
>> "fuzzy" did not appear in it ...
>>
>> > Is this part of the “learn” project or is there some other approach to it?
>>
>> The "learn" project has maybe 300+ pages of docs, but the basic ideas
>> are spelled out in a bunch of README's and overviews. It is possible
>> that these fail to communicate the ideas correctly, and .. that's
>> fixable, but will take some time. I'd rather exchange emails and
>> take steps one-at-a-time, rather than send you out to read hundreds of
>> pages of stuff...
>>
>> > Well, we can reuse some of those for our purposes — a generic object 
>> > detection model can be used to spot all sorts of things on an image,
>>
>> Sure, but it will take many years if not a decade to build a "generic
>> object detection model". I don't think this is something easy or
>> quick -- that's the end-point, not the start point.
>>
>> > we just need to find one that was trained with a taxonomy that suits us.
>>
>> Learning the "taxonomy" would be a rather advanced stage of the
>> project. I'm sort-of-ish exploring some basic aspects of something
>> like that at the NLP level, but so far, its mostly ideas and very
>> little functional code. It will be at least a year and probably a lot
>> more, before we can learn taxonomy of visual or audio inputs. There's
>> a huge amount of preliminaries that have to be gotten out of the way.
>>
>> > Using such models with OpenCog has been done already by Alexei Potapov et 
>> > al. if I remember correctly. It’s mostly a matter of adapting that scheme 
>> > to the specifics of this project.
>>
>> ? Alexy is working on something else entirely. I don't know what it
>> is, but its pretty much totally unrelated to what I'm working on...
>> unless he's keeping some secrets from me ...
>>
>> > The challange is, as always, to find data that has a model that can detect 
>> > things we want — with faces for example I can’t find detectors for face 
>> > parts but I can find models detecting key points, which includes mouths 
>> > and eyes.
>>
>> OK, this is a misunderstanding. The goal is to NOT use some
>> pre-trained, pre-built, human-engineered face detector trained on a
>> corpus carefully curated by humans. If you are using systems that are
>> hand-crafted, hand-curated by humans, its not AGI any more. I'm very
>> much trying to go in the exact opposite direction. The goal is to get
>> the human data-engineering out of the loop.
>>
>> Detecting faces will be hard. It *might* be possible, maybe, once
>> everything is wired up, tested, debugged, tuned, tweaked, re-designed
>> and re-written a few times. I doubt face detection will be achievable
>> any sooner than a year from now, and that's only if it's a year of
>> full-time hard work and a whole lot of luck. Otherwise, I think face
>> detection is probably out of reach, for the short term. Lots of much
>> more basic things have to come together, first.
>>
>> I dunno, maybe one could get magically lucky, but I doubt it ...
>>
>> -- linas
>>
>>
>> --
>> Patrick: Are they laughing at us?
>> Sponge Bob: No, Patrick, they are laughing next to us.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/opencog/4447730d-70d7-47ac-b225-a0c19c36d64fn%40googlegroups.com.

-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35ECddjQ-bANoUPtXeGrMTWb%2BCYxvGYxWn7_bxXqji_6w%40mail.gmail.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to