Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Adrian Borucki Tue, 14 Sep 2021 07:09:35 -0700


On Monday, 13 September 2021 at 19:53:55 UTC+2 linas wrote:


> On Mon, Sep 13, 2021 at 6:49 AM Adrian Borucki <[email protected]> wrote: 
> > 
> > On Sunday, 12 September 2021 at 18:55:23 UTC+2 linas wrote: 
> >> 
> >> On Sun, Sep 12, 2021 at 8:29 AM Adrian Borucki <[email protected]> 
> wrote: 
> >> > 
> >> >> ---- 
> >> >> As to divine intervention vs. bumbling around: I'm still working on 
> >> >> unsupervised learning, which I hope will someday be able to learn 
> the 
> >> >> rules of (common-sense) inference. I think I know how to apply it to 
> >> >> audio and video data, and am looking for anyone who is willing to 
> get 
> >> >> neck-deep in both code and theory. In particular, for audio and 
> >> >> video, I need someone who knows GPU audio/video processing 
> libraries, 
> >> >> and is willing to learn how to wrap them in Atomese. For starters. 
> >> > 
> >> > 
> >> > I might have some time to help with this - I only did a bit of video 
> / audio processing for ML but I have 
> >> > some familiarity of AtomSpace, so that part should be easier. 
> >> > 
> >> 
> >> Wow! That would be awesome! 
> >> 
> >> I thought some more about the initial steps. A large part of this 
> >> would be setting up video/audio filters to run on GPU's, with the goal 
> >> of being able to encode the filtering pipeline in Atomese -- so that 
> >> expressions like "apply this filter then that filer then combine this 
> >> and that" are stored as expressions in the AtomSpace. 
> >> 
> >> The research program would then be to look for structural correlations 
> >> in the data. Generate some "random" filter sequences (building on 
> >> previously "known good" filter structures) and see if they have 
> >> "meaningful" correlations in them. Build up a vocabulary of "known 
> >> good" filter sequences. 
> >> 
> >> One tricky part is finding something simple to start with. I imagined 
> >> the local webcam feed: it should be able to detect when I'm in front 
> >> of the keyboard, and when not, and rank that as an "interesting" fact. 
> > 
> > 
> > Sounds like something that would be processed with a library like OpenCV 
> — it’s important to distinguish between 
> > video data loading and using GPU-accelerated operations. My experience 
> with the latter is very small as this is something usually wrapped with 
> some 
> > library like PyTorch or RAPIDS. Also there is a difference between 
> running something on-line vs batch processing of a dataset — you mostly 
> gain from GPU acceleration 
> > when working with the latter, unless it’s something computationally 
> expensive that’s supposed to run in real time. 
> > 
> > First, we need to elucidate what actual “filters” are supposed to be 
> used — when we have a list I can think about how the operations would be 
> run. 
> > Second, if you don’t have an existing dataset that we can use then we 
> have to build one, that is probably the most time and resource-consuming 
> task here… probably should be done first actually. 
> > There are existing video datasets that might be useful, it’s worth 
> looking into those. 
>
> Good. Before that, though, I think we need to share a general vision 
> of what the project "actually is", because that will determine 
> datasets, libraries, etc. I tried to write those down in a file 
> https://github.com/opencog/learn/blob/master/README-Vision.md -- but 
> it is missing important details, so let me try an alternate sketch. 
>
> So here's an anecdote from Sophia the Robot: she had this habit of 
> trying to talk through an audience clapping. Basically, she could not 
> hear, and didn't know to pause when the audience clapped. (Yes, almost 
> all her performances are scripted. Some small fraction are ad libbed.) 
> A manual operator in the audience would have to hit a pause button, to 
> keep her from rambling on. So I thought: "How can I build a clap 
> detector?" Well, it would have to be some kind of audio filter -- some 
> level of white noise (broad spectrum noise), but with that peculiar 
> clapping sound (so, not pure white noise, but dense shot noise.) 
> Elevated above a threshold T for some time period of S at least one 
> second long. It is useful to think of this as a wiring diagram: some 
> boxes connected with lines; each box might have some control 
> parameters: length, threshold, time, frequency. 
>
> So how do I build a clap detector? Well, download some suitable audio 
> library, get some sound samples, and start trying to wire up some 
> threshold detector *by hand*. Oooof. Yes, you can do it that way: 
> classical engineering. After that, you have a dozen different other 
> situations: booing. Laughing. Tense silence. Chairs scraping. And 
> after that, a few hundred more... it's impossible to hand-design a 
> filter set for every interesting case. So, instead: unleash automated 
> learning. That is, represent the boxes and wires as Nodes and Links 
> in the AtomSpace (the audio stream itself would be an 
> AudioStreamValue) and let some automated algo rearrange the wiring 
> diagram until it finds a good one. 
>
> But what is a "good wiring diagram"? Well, the current very 
> fashionable approach is to develop a curated labelled training set, 
> and train on that. "Curated" means "organized by humans" (Ooof-dah. 
> humans in the loop again!) and "labelled" means each snippet has a 
> tag: "clapping" - "cheering" - "yelling". (Yuck. What kind of yelling? 
> Happy? Hostile? Asking for help? Are the labels even correct?) This 
> might be the way people train neural nets, but really, its the wrong 
> approach for AGI. I don't want to do supervised training. (I mean, we 
> could do supervised training in the opencog framework, but I don't see 
> any value in that, right now.) So, lets do unsupervised training. 
>
> But how? Now for a conceptual leap. This leap is hard to explain in 
> terms of audio filters (its rather abstract) so I want to switch to 
> vision, before getting back to audio. For vision, I claim there 
> exists something called a "shape grammar". I hinted at this in the 
> last email. A human face has a shape to it - a pair of eyes, 
> symmetrically arranged above a mouth, in good proportion, etc. This 
> shape has a "grammar" that looks like this: 
>
> left-eye: (connects-to-right-to-right-eye) and 
> (connects-below-to-mouth) and (connects-above-to-forehead); 
> forehead: (connects-below-to-left-eye) and 
> (connects-below-to-right-eye) and (connects-above-to-any-background); 
>
> Now, if you have some filter collection that is able to detect eyes, 
> mouths and foreheads, you can verify whether you have detected an 
> actual face by checking against the above grammar. If all of the 
> connectors are satisfied, then you have a "grammatically correct 
>

Here’s the part I have questions about: how do you deal with the fact that 
the regions won’t often be connected?
I am familiar with an idea of using Region Connection Calculus mentioned in 
places like “Symbol Grounding via Chaining of Morphisms” and chapter 17 on 
spatio-temporal inference from EGI vol. 2.
And it seems you have to use fuzzy versions of these relationships because, 
using the face grammar example, you won’t get a situation where, for 
instance, detected eye regions (like bounding boxes from an object 
detector) are exactly connected together — there is going to be some 
distance in between.

So how do you deal with this? The STI chapter mentions certain 
computational difficulties with the fuzzy approach and proposes that using 
some crude assumptions you could have something that could then be trained 
on a dataset to further improve it.
Is this part of the “learn” project or is there some other approach to it?

description of a face". So, although your filter collection was 
> plucking eye-like and mouth-like features out of an image, the fact 
> that they could be arranged into a grammatically-correct arrangement 
> raises your confidence that you are seeing a face. 
>
> Those people familiar with Link Grammar will recognize the above as a 
> peculiar variant of a Link-Grammar dictionary. (and thus I am cc'ing 
> the mailing list.) 
>
> But where did the grammar come from? For that matter, where did the 
> eye and mouth filters come from? It certainly would be a mistake to 
> have an army of grad students writing shape grammars by hand. The 
> grammar has to be learned automatically, in an unsupervised fashion. 
> ... and that is what the opencog/learn project is all about. 
>
> At this point, things become very highly abstract very quickly, and I 
> will cut this email short. Very roughly, though: one looks for 
> pair-wise correlations in data. Having found good pairs, one then 
> draws maximum spanning trees (or maximum planar graphs) with those 
> pairs, and extracts frequently-occurring vertex-types, and their 
> associated connectors. That gives you a raw grammar. Generalization 
> requires clustering specific instances of this into general forms. I'm 
> working on those algos now. 
>
> The above can learn (should be able to learn) both a "shape grammar" 
> and also a "filter grammar" ("meaningful" combinations of processing 
> filters. Meaningful, in that they extract correlations in the data.) 
>
> So that is the general idea. Now, to get back to your question: what 
> sort of video (or audio) library? What sort of dataset? I dunno. 
> Beats me. Best to start small: find some incredibly simple problem, 
> and prove that the general idea works on that. Scale up from there. 
> You get to pick that problem, according to taste. 
>
> One idea was to build a "French flag detector": this should be "easy": 
> its just three color bars, one above the other. The grammar is very 
> simple. The training set might be a bunch of French flags. Now, if 
> the goal is to ONLY learn the shape grammar, then you have to hack up, 
> by hand, some adhoc color and hue and contrast filters. If you want to 
> learn the filter grammar, then .. well, that's a lot harder for 
> vision, because almost all images are extremely information-rich. The 
> training corpus would have to be selected to be very simple: only 
> those flags in canonical position (not draped) Then, either one has 
> extremely simple backgrounds, or one has a very large corpus, as 
> otherwise, you risk training on something in the background, instead 
> of the flags. 
>
> For automated filter-grammars, perhaps audio is simpler? Because most 
> audio samples are not as information-rich as video/photos? 
>
> I dunno. This is where it becomes hard. Even before all the fancy 
> theory and what-not, finding a suitable toy problem that is solvable 
> without a hopeless amount of CPU -processing and practical stumbling 
> blocks .. that's hard. Even worse is that state-of-the-art neural-net 
> systems have billions of CPU-hours behind them, computed with 
> well-written, well-debugged, highly optimized software, created by 
> armies of salaried PhD's working at the big tech companies. Any 
> results we get will look pathetic, compared to what those systems can 
> do. 
>

Well, we can reuse some of those for our purposes — a generic object 
detection model can be used to spot all sorts of things on an image, we 
just need to find one that was trained with a taxonomy that suits us.
Using such models with OpenCog has been done already by Alexei Potapov et 
al. if I remember correctly. It’s mostly a matter of adapting that scheme 
to the specifics of this project.

The challange is, as always, to find data that has a model that can detect 
things we want — with faces for example I can’t find detectors for face 
parts but I can find models detecting key points, which includes mouths and 
eyes.
(like this library: https://github.com/open-mmlab/mmpose with this dataset: 
https://github.com/jin-s13/COCO-WholeBody)
 

>
> The reason I find it promising is this: All those neural net systems 
> do is supervised training. They don't actually "think", they don't 
> need to. They don't need to find relationships out of thin air. So I 
> think this is something brand new that we're doing that no one else 
> does. Another key difference is that we are working explicitly at the 
> symbolic level. By having a grammar, we have an explicit part-whole 
> relationship. This is something the neural-net guys cannot do (Hinton, 
> I believe, has a paper on how one day in the distant future, neural 
> nets might be able to solve the part-whole relationship problem. By 
> contrast, we've already solved it, more or less from day one.) 
>
> We've also "solved" the "symbol grounding problem" -- from day one. 
> This is another problem that AI researchers have been wringing their 
> hands about, from the 1960's onwards. Our symbols are grounded, from 
> the start: our symbols are the filter sets, the grammatical dictionary 
> entries, and we "know what they mean" because they work with explicit 
> data. 
>
> Another very old AI problem is the "frame problem", and I think that 
> we've got that one licked, too, although this is a far more tenuous 
> claim. The "frame problem" is one of selecting only those things that 
> are relevant to a particular reasoning problem, and ignoring all of 
> the rest. Well, hey: this is exactly what grammars do: they tell you 
> exactly what is relevant, and they ignore the rest. The grammars have 
> learned to ignore the background features that don't affect the 
> current situation. But whatever... This gets abstract and can lead to 
> an endless spill of words. I am much more interested in creating 
> software that actually works. 
>
> So .. that's it. What are the next steps? How can we do this? 
>
> -- Linas 
>
> -- 
> Patrick: Are they laughing at us? 
> Sponge Bob: No, Patrick, they are laughing next to us. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/af835338-4511-4251-97d2-89865efce045n%40googlegroups.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to