Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Linas Vepstas Tue, 21 Sep 2021 18:30:31 -0700

Hi Ben,

By RCC, I guess you mean the "region calculus"? This isn't that. This
is more like moses-for-images. Except it's unsupervised. So more like
"pattern miner for images". Except it's not using the pattern miner
infrastructure, it's using the vector+matrix infrastructure.


--linas

On Tue, Sep 21, 2021 at 8:25 PM 'Ben Goertzel' via opencog
<[email protected]> wrote:
>
> Hi, the RCC stuff was work done by me and Keyvan Sadeghi quite some
> years ago, which was paused not because it wasn't working but because
> Keyvan moved on to other stuff...
>
> Linas never dealt with that stuff, as far as I recall ...
>
> I think to make that sort of approach work scalably, you would need to
> use a hybrid inference engine that uses a specialized prover for
> fuzzy-RCC, interoperating with a general-purpose PLN prover for
> general conceptual relationships among the entities occupying the
> regions...  But we never got there and shifted attention to other
> things...
>
> ben
>
> On Tue, Sep 14, 2021 at 7:10 AM Adrian Borucki <[email protected]> wrote:
> >
> >
> >
> > On Monday, 13 September 2021 at 19:53:55 UTC+2 linas wrote:
> >>
> >> On Mon, Sep 13, 2021 at 6:49 AM Adrian Borucki <[email protected]> wrote:
> >> >
> >> > On Sunday, 12 September 2021 at 18:55:23 UTC+2 linas wrote:
> >> >>
> >> >> On Sun, Sep 12, 2021 at 8:29 AM Adrian Borucki <[email protected]> 
> >> >> wrote:
> >> >> >
> >> >> >> ----
> >> >> >> As to divine intervention vs. bumbling around: I'm still working on
> >> >> >> unsupervised learning, which I hope will someday be able to learn the
> >> >> >> rules of (common-sense) inference. I think I know how to apply it to
> >> >> >> audio and video data, and am looking for anyone who is willing to get
> >> >> >> neck-deep in both code and theory. In particular, for audio and
> >> >> >> video, I need someone who knows GPU audio/video processing libraries,
> >> >> >> and is willing to learn how to wrap them in Atomese. For starters.
> >> >> >
> >> >> >
> >> >> > I might have some time to help with this - I only did a bit of video 
> >> >> > / audio processing for ML but I have
> >> >> > some familiarity of AtomSpace, so that part should be easier.
> >> >> >
> >> >>
> >> >> Wow! That would be awesome!
> >> >>
> >> >> I thought some more about the initial steps. A large part of this
> >> >> would be setting up video/audio filters to run on GPU's, with the goal
> >> >> of being able to encode the filtering pipeline in Atomese -- so that
> >> >> expressions like "apply this filter then that filer then combine this
> >> >> and that" are stored as expressions in the AtomSpace.
> >> >>
> >> >> The research program would then be to look for structural correlations
> >> >> in the data. Generate some "random" filter sequences (building on
> >> >> previously "known good" filter structures) and see if they have
> >> >> "meaningful" correlations in them. Build up a vocabulary of "known
> >> >> good" filter sequences.
> >> >>
> >> >> One tricky part is finding something simple to start with. I imagined
> >> >> the local webcam feed: it should be able to detect when I'm in front
> >> >> of the keyboard, and when not, and rank that as an "interesting" fact.
> >> >
> >> >
> >> > Sounds like something that would be processed with a library like OpenCV 
> >> > — it’s important to distinguish between
> >> > video data loading and using GPU-accelerated operations. My experience 
> >> > with the latter is very small as this is something usually wrapped with 
> >> > some
> >> > library like PyTorch or RAPIDS. Also there is a difference between 
> >> > running something on-line vs batch processing of a dataset — you mostly 
> >> > gain from GPU acceleration
> >> > when working with the latter, unless it’s something computationally 
> >> > expensive that’s supposed to run in real time.
> >> >
> >> > First, we need to elucidate what actual “filters” are supposed to be 
> >> > used — when we have a list I can think about how the operations would be 
> >> > run.
> >> > Second, if you don’t have an existing dataset that we can use then we 
> >> > have to build one, that is probably the most time and resource-consuming 
> >> > task here… probably should be done first actually.
> >> > There are existing video datasets that might be useful, it’s worth 
> >> > looking into those.
> >>
> >> Good. Before that, though, I think we need to share a general vision
> >> of what the project "actually is", because that will determine
> >> datasets, libraries, etc. I tried to write those down in a file
> >> https://github.com/opencog/learn/blob/master/README-Vision.md -- but
> >> it is missing important details, so let me try an alternate sketch.
> >>
> >> So here's an anecdote from Sophia the Robot: she had this habit of
> >> trying to talk through an audience clapping. Basically, she could not
> >> hear, and didn't know to pause when the audience clapped. (Yes, almost
> >> all her performances are scripted. Some small fraction are ad libbed.)
> >> A manual operator in the audience would have to hit a pause button, to
> >> keep her from rambling on. So I thought: "How can I build a clap
> >> detector?" Well, it would have to be some kind of audio filter -- some
> >> level of white noise (broad spectrum noise), but with that peculiar
> >> clapping sound (so, not pure white noise, but dense shot noise.)
> >> Elevated above a threshold T for some time period of S at least one
> >> second long. It is useful to think of this as a wiring diagram: some
> >> boxes connected with lines; each box might have some control
> >> parameters: length, threshold, time, frequency.
> >>
> >> So how do I build a clap detector? Well, download some suitable audio
> >> library, get some sound samples, and start trying to wire up some
> >> threshold detector *by hand*. Oooof. Yes, you can do it that way:
> >> classical engineering. After that, you have a dozen different other
> >> situations: booing. Laughing. Tense silence. Chairs scraping. And
> >> after that, a few hundred more... it's impossible to hand-design a
> >> filter set for every interesting case. So, instead: unleash automated
> >> learning. That is, represent the boxes and wires as Nodes and Links
> >> in the AtomSpace (the audio stream itself would be an
> >> AudioStreamValue) and let some automated algo rearrange the wiring
> >> diagram until it finds a good one.
> >>
> >> But what is a "good wiring diagram"? Well, the current very
> >> fashionable approach is to develop a curated labelled training set,
> >> and train on that. "Curated" means "organized by humans" (Ooof-dah.
> >> humans in the loop again!) and "labelled" means each snippet has a
> >> tag: "clapping" - "cheering" - "yelling". (Yuck. What kind of yelling?
> >> Happy? Hostile? Asking for help? Are the labels even correct?) This
> >> might be the way people train neural nets, but really, its the wrong
> >> approach for AGI. I don't want to do supervised training. (I mean, we
> >> could do supervised training in the opencog framework, but I don't see
> >> any value in that, right now.) So, lets do unsupervised training.
> >>
> >> But how? Now for a conceptual leap. This leap is hard to explain in
> >> terms of audio filters (its rather abstract) so I want to switch to
> >> vision, before getting back to audio. For vision, I claim there
> >> exists something called a "shape grammar". I hinted at this in the
> >> last email. A human face has a shape to it - a pair of eyes,
> >> symmetrically arranged above a mouth, in good proportion, etc. This
> >> shape has a "grammar" that looks like this:
> >>
> >> left-eye: (connects-to-right-to-right-eye) and
> >> (connects-below-to-mouth) and (connects-above-to-forehead);
> >> forehead: (connects-below-to-left-eye) and
> >> (connects-below-to-right-eye) and (connects-above-to-any-background);
> >>
> >> Now, if you have some filter collection that is able to detect eyes,
> >> mouths and foreheads, you can verify whether you have detected an
> >> actual face by checking against the above grammar. If all of the
> >> connectors are satisfied, then you have a "grammatically correct
> >
> >
> > Here’s the part I have questions about: how do you deal with the fact that 
> > the regions won’t often be connected?
> > I am familiar with an idea of using Region Connection Calculus mentioned in 
> > places like “Symbol Grounding via Chaining of Morphisms” and chapter 17 on 
> > spatio-temporal inference from EGI vol. 2.
> > And it seems you have to use fuzzy versions of these relationships because, 
> > using the face grammar example, you won’t get a situation where, for 
> > instance, detected eye regions (like bounding boxes from an object 
> > detector) are exactly connected together — there is going to be some 
> > distance in between.
> >
> > So how do you deal with this? The STI chapter mentions certain 
> > computational difficulties with the fuzzy approach and proposes that using 
> > some crude assumptions you could have something that could then be trained 
> > on a dataset to further improve it.
> > Is this part of the “learn” project or is there some other approach to it?
> >
> >> description of a face". So, although your filter collection was
> >> plucking eye-like and mouth-like features out of an image, the fact
> >> that they could be arranged into a grammatically-correct arrangement
> >> raises your confidence that you are seeing a face.
> >>
> >> Those people familiar with Link Grammar will recognize the above as a
> >> peculiar variant of a Link-Grammar dictionary. (and thus I am cc'ing
> >> the mailing list.)
> >>
> >> But where did the grammar come from? For that matter, where did the
> >> eye and mouth filters come from? It certainly would be a mistake to
> >> have an army of grad students writing shape grammars by hand. The
> >> grammar has to be learned automatically, in an unsupervised fashion.
> >> ... and that is what the opencog/learn project is all about.
> >>
> >> At this point, things become very highly abstract very quickly, and I
> >> will cut this email short. Very roughly, though: one looks for
> >> pair-wise correlations in data. Having found good pairs, one then
> >> draws maximum spanning trees (or maximum planar graphs) with those
> >> pairs, and extracts frequently-occurring vertex-types, and their
> >> associated connectors. That gives you a raw grammar. Generalization
> >> requires clustering specific instances of this into general forms. I'm
> >> working on those algos now.
> >>
> >> The above can learn (should be able to learn) both a "shape grammar"
> >> and also a "filter grammar" ("meaningful" combinations of processing
> >> filters. Meaningful, in that they extract correlations in the data.)
> >>
> >> So that is the general idea. Now, to get back to your question: what
> >> sort of video (or audio) library? What sort of dataset? I dunno.
> >> Beats me. Best to start small: find some incredibly simple problem,
> >> and prove that the general idea works on that. Scale up from there.
> >> You get to pick that problem, according to taste.
> >>
> >> One idea was to build a "French flag detector": this should be "easy":
> >> its just three color bars, one above the other. The grammar is very
> >> simple. The training set might be a bunch of French flags. Now, if
> >> the goal is to ONLY learn the shape grammar, then you have to hack up,
> >> by hand, some adhoc color and hue and contrast filters. If you want to
> >> learn the filter grammar, then .. well, that's a lot harder for
> >> vision, because almost all images are extremely information-rich. The
> >> training corpus would have to be selected to be very simple: only
> >> those flags in canonical position (not draped) Then, either one has
> >> extremely simple backgrounds, or one has a very large corpus, as
> >> otherwise, you risk training on something in the background, instead
> >> of the flags.
> >>
> >> For automated filter-grammars, perhaps audio is simpler? Because most
> >> audio samples are not as information-rich as video/photos?
> >>
> >> I dunno. This is where it becomes hard. Even before all the fancy
> >> theory and what-not, finding a suitable toy problem that is solvable
> >> without a hopeless amount of CPU -processing and practical stumbling
> >> blocks .. that's hard. Even worse is that state-of-the-art neural-net
> >> systems have billions of CPU-hours behind them, computed with
> >> well-written, well-debugged, highly optimized software, created by
> >> armies of salaried PhD's working at the big tech companies. Any
> >> results we get will look pathetic, compared to what those systems can
> >> do.
> >
> >
> > Well, we can reuse some of those for our purposes — a generic object 
> > detection model can be used to spot all sorts of things on an image, we 
> > just need to find one that was trained with a taxonomy that suits us.
> > Using such models with OpenCog has been done already by Alexei Potapov et 
> > al. if I remember correctly. It’s mostly a matter of adapting that scheme 
> > to the specifics of this project.
> >
> > The challange is, as always, to find data that has a model that can detect 
> > things we want — with faces for example I can’t find detectors for face 
> > parts but I can find models detecting key points, which includes mouths and 
> > eyes.
> > (like this library: https://github.com/open-mmlab/mmpose with this dataset: 
> > https://github.com/jin-s13/COCO-WholeBody)
> >
> >>
> >>
> >> The reason I find it promising is this: All those neural net systems
> >> do is supervised training. They don't actually "think", they don't
> >> need to. They don't need to find relationships out of thin air. So I
> >> think this is something brand new that we're doing that no one else
> >> does. Another key difference is that we are working explicitly at the
> >> symbolic level. By having a grammar, we have an explicit part-whole
> >> relationship. This is something the neural-net guys cannot do (Hinton,
> >> I believe, has a paper on how one day in the distant future, neural
> >> nets might be able to solve the part-whole relationship problem. By
> >> contrast, we've already solved it, more or less from day one.)
> >>
> >> We've also "solved" the "symbol grounding problem" -- from day one.
> >> This is another problem that AI researchers have been wringing their
> >> hands about, from the 1960's onwards. Our symbols are grounded, from
> >> the start: our symbols are the filter sets, the grammatical dictionary
> >> entries, and we "know what they mean" because they work with explicit
> >> data.
> >>
> >> Another very old AI problem is the "frame problem", and I think that
> >> we've got that one licked, too, although this is a far more tenuous
> >> claim. The "frame problem" is one of selecting only those things that
> >> are relevant to a particular reasoning problem, and ignoring all of
> >> the rest. Well, hey: this is exactly what grammars do: they tell you
> >> exactly what is relevant, and they ignore the rest. The grammars have
> >> learned to ignore the background features that don't affect the
> >> current situation. But whatever... This gets abstract and can lead to
> >> an endless spill of words. I am much more interested in creating
> >> software that actually works.
> >>
> >> So .. that's it. What are the next steps? How can we do this?
> >>
> >> -- Linas
> >>
> >> --
> >> Patrick: Are they laughing at us?
> >> Sponge Bob: No, Patrick, they are laughing next to us.
> >
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "opencog" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected].
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msgid/opencog/af835338-4511-4251-97d2-89865efce045n%40googlegroups.com.
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> “He not busy being born is busy dying" -- Bob Dylan
>
> --
> You received this message because you are subscribed to the Google Groups 
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/opencog/CACYTDBe4UhapG-JwNRtSk8iyypn6maTA1UoTQY%3DsrUvi-8odng%40mail.gmail.com.



-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA34FJ7k%2Bgug9929Mb_14rg7UZCqPDpmS7hFK4eVQLdtTyA%40mail.gmail.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to