Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Adrian Borucki Thu, 16 Sep 2021 13:02:22 -0700


On Wednesday, 15 September 2021 at 19:55:04 UTC+2 linas wrote:


> On Wed, Sep 15, 2021 at 6:59 AM Adrian Borucki <[email protected]> wrote: 
> > 
> > Okay, some clarification is needed because there is a sentence 
> > > Now, if you have some filter collection that is able to detect eyes, 
> mouths and foreheads 
> > 
> > That suggests using some pre-existing (i. e. human-engineered) solution 
> to find things like eyes and mouths. 
>
> Ah, my mistake then. The intent was to illustrate a concept of a 
> "shape grammar": a means of describing the relationships of things in 
> 2D, 3D or N-dimensional spaces. The "things" have labelled edges 
> connecting them; the "grammar" is what you get if you cut the edges in 
> half: a "thing" with a collection of labelled half-edges 
> ("connectors"). 
>
> If the "things" can be connected up in a grammatically-valid way, then 
> one has some assurance that a "face" was correctly recognized, because 
> all the parts are where they should be. 
>
> I used "face" as an example, because it seemed like it was easy to 
> explain. Of course, there is a recursion problem: how do you know that 
> something is an eye or a mouth? Those problems are also solved in a 
> similar fashion: a networked arrangement of filters -- a graph with 
> labelled edges -- having a grammar to it. 
>
> I'll try to fix the README to make this more clear 
>
> > That said, I’ve looked again into the README and see that you 
> specifically mention segmentation as an image processing step. 
>
> Ugh. I mentioned that only as one possible "quick hack" (scaffolding) 
> to get a proof-of-concept working. It would have to be replaced by a 
> proper (learned) filter set for the final version. (Of course, the 
> learner might learn how to segment, but that is an unrelated result.) 
>
> > That makes sense, as segmentation means assigning a label to each pixel 
> of the image. That means everything is going to be connected to something 
> (in the worst case that something being the background). 
>
> No, I want to go very much in the opposite of that direction. I do NOT 
> want any pixels anywhere in the pipeline, and certainly not any 
> labelled pixels. 
>
> When I say "filter", I am envisioning a detector, for example, 
> something that says "the upper half of the visual field is blue and 
> the lower half of the visual field is green", and this gives a one-bit 
> result: true or false. This is not quite a primitive filter, but 
> rather is composed of some filters for hue and maybe brightness, and 
> some other filters that accept or reject upper and lower parts of the 
> visual field. 
>
> Exactly what sequence of image operations it is composed of would have 
> to be learned (rather than hand built). It would be learned by 
> observing many photos of outdoor scenes, and statistically noting that 
> blue is always above (even in photos of city scenes) and that green is 
> often below (in nature photos). Heck -- just being able to detect 
> "blue is above" and converting that into a one-bit true-false value 
> becomes an indicator that it is an outdoor scene. The "parsed image" 
> is the combination of the grammatical elements, the linkage that there 
> is blue above, and something else below (that is, a vertical change in 
> hue or brightness or saturation -- a somewhat sharp change -- perhaps 
> one with many sharp but randomly oriented derivatives.). 
>
> These filters need to be pixel-independent: I want to avoid the 
> silliness of having to write 1024 different filters, each having the 
> horizon in a slightly different pixel position. This means that the 
> filters really need to be wavelet filters, so that relative sizes and 
> scales are handled automatically. 
>
> ... at least, that is the long-run idea. For bring-up and debugging, 
> almost any and all hacks are allowed, as otherwise rapid development 
> and testing is impossible. 
>
> > All in all, I think using a pre-existing segmentation model would for 
> now simplify the project but that is your call to make of course. 
>
> Heh. Well, not "my call" -- this needs to be a collaborative project, 
> and I have no desire to project an authoritarian personality. So, 
> more like "your call", but I want you to make the right decision, 
> based on the understanding of what the project actually is. Given 
> that this is experimental and exploratory, it is entirely normal that 
> the process will be filled with bad decisions and failed designs. 
>
> For bring-up, to develop and prove that a shape grammar can actually 
> be learned, I suppose that some pre-existing segmentation model might 
> be OK, maybe. It makes me nervous, though, because it builds in a 
> component that might be hard to remove later. Also, I think that 
> learning the shape grammar is easy. People have talked about shape 
> grammars for 50 years, its not a new or novel concept; it should not 
> be that hard. The hard part is learning filter sequences that produce 
> useful outputs. 
>
> ... and my philosophy of development is to focus on the hard parts 
> first. It's always easy to do the easy parts later. 
>
> > I can’t really opine about anything related to doing the “classic” 
> Computer Vision — my only guess is that you can probably hook up OpenCV 
> with AtomSpace. 
>
> Yes, and that would be an important part of the project. The tricky 
> part is how to not waste too much time on this -- how to hook up just 
> enough to get the basic ideas working. 
>
> This means picking half-a-dozen or a dozen basic image operations -- 
> hue, brightness filters, maybe some edge detectors, laplacians, 
> threshold filters -- and figuring out how to compose them together. In 
> Atomese, it would look something like this: 
>
> (GreaterThanLink 0.5 ; select blue values 
> (HueFilterLink (Number 0.0 0.0 1.0) ; RGB, no red, no green, only blue 
> (HaarWavelet (Number 0 1) ; lowest order Haar in vertical 
> direction; none in horizontal 
> (VariableNode "x")))) 
>
> The above specifies a filter arrangement in Atomese, that would get 
> bound to a specific OpenCV pipe, when (Variable "x") is bound to a 
> webcam or photo. The above is just an example -- the learning process 
> would attempt different combinations of such things, and vary the 
> parameters, looking for "meaningful" pipelines and parameters. 
>
> Note that there are no pixels and no segmentation in the above. I 
> guess we could have a (ConvexContingousRegionLink ...) that detects a 
> convex group of pixels that are mostly the same color... but this does 
> not seem required right now. 
>
> In terms of AI, this is again nothing new: people have been writing 
> evolutionary algorithms to automatically discover these kinds of 
> processing pipelines, for many decades. At least, they were before 
> deep-learning. I think the progress in deep learning has halted work 
> on such ideas, because the neural nets work so fast and so well. So 
> what i'm proposing above is a big step backwards -- a big step 
> backwards in time, a big step backwards in computational ability, 
> compared to neural nets. The hoped-for gain is to have explicit 
> symbolic control over the elements in the pipeline. The learned 
> pipelines and parameters may be pretty random-looking, but they will 
> have an explicit symbolic representation, thus making them open to 
> reasoning, inference, deduction, and assorted abstract symbolic 
> manipulations, which neural nets cannot do. 
>
Yeah, this is clear to me to now — the grammar learning part is kind of a 
given, the real question is how well this “image predicate” learning can 
go… This is a deep question as no one is even sure why neural nets 
themselves work so well.
What needs clarification is what the structure of this filter learning 
would be — what is the algorithm and what direct learning objective is it 
given?
Like in the above example, where are all these filters and numerical 
arguments even coming from? The numerical part is especially difficult, 
given that you seemingly want to get some symbolic structure out of it.

Going back to neural nets, the obvious problem is that if we make one big 
neural “filter” then you don’t know what is going on inside — so the 
learning will be “shallower”. The question is how much of a problem this 
really is.
Is learning down to the low-level filtering operations a viable approach 
right now?
An interesting research question is if you could train a neural net that 
can be “queried”, possibly in natural language or some simple formal one, 
so that the system on top of it can learn to “extract” various statements 
about an image out of it — so these predicates would be essentially hooked 
to some queries that get send to the underlying model. Technically this 
probably falls somewhere in the Visual Question Answering field… the 
challenge is that these models are trained to answer questions about more 
abstract things like objects, not some low level features of the image.

The final big question is what can you really do after you get that 
grammar? What sort of inferences? How useful they are? The key thing here 
is that if you, say, have a system that classifies pictures, if it being 
built on top of this whole grammar and filter learning pipeline means it 
doesn’t achieve competitive performance with neural nets then it’s 
difficult to see what the comparative advantage of it is — beyond the 
obvious advantage of interpretability, but that won’t save that solution if 
its performance is considerably lower.
Well, the problem is not really with grammars, that can definitely be 
useful, but if that “filter sequence” part works poorly then it will 
bottleneck the performance of the entire system. If that low level layer 
outputs garbage, then all the upper layers get garbage, and we know what 
happens when you have garbage inputs in this field...
 

>
>
> > Also a handcrafted segmentator is still a human-engineered solution — I 
> don’t know if it’s really simpler because it requires more domain-specific 
> knowledge to understand and modify and is not going to be very robust. 
>
> Right. So maybe these should be avoided. 
>
> > Anyway, for now I don’t see much more to discuss actually, when you have 
> decided what data to use we can just move on to implementing the basic 
> functionality. 
> > 
> > By the way there is also research into unsupervised segmentation with 
> models like MONet or GENESIS that could be trained on arbitrary data to try 
> to figure out what things to segment as “anonymous" objects. 
>
> I saw that other email thread, I'll respond to it later (a few days, 
> maybe) We do need a source of test images. This could be (for the 
> above example) a collection of out-door photos with blue skies in 
> them. But maybe also a collection of children's toys on a table, 
> with photos taken from many different angles -- the different angles 
> would cause the filters to learn about objects in space. A second, 
> more difficult training set would involve the same toys, rearranged in 
> different locations. 
>
> --linas 
>
> > It is still in fairly early stages though — those models handle just 
> 64x64 pixel images (now with colour, thankfully) and of course are not 
> particularly cheap to train… on an easy dataset perhaps using one of the 
> findable checkpoints from such pre-trained models would work, that would 
> have to be tested. 
> > 
> > On Wednesday, 15 September 2021 at 04:53:18 UTC+2 linas wrote: 
> >> 
> >> Trimming back the first part of the conversation... 
> >> 
> >> On Tue, Sep 14, 2021 at 9:09 AM Adrian Borucki <[email protected]> 
> wrote: 
> >> 
> >> > 
> >> > Here’s the part I have questions about: how do you deal with the fact 
> that the regions won’t often be connected? 
> >> 
> >> I don't understand the question. What regions? Where did they come 
> >> from? What do you mean by "region"? 
> >> 
> >> > I am familiar with an idea of using Region Connection Calculus 
> mentioned in places like “Symbol Grounding via Chaining of Morphisms” and 
> chapter 17 on spatio-temporal inference from EGI vol. 2. 
> >> 
> >> I'm not familiar with this. What is a "region"? 
> >> 
> >> > And it seems you have to use fuzzy versions of these relationships 
> because, 
> >> 
> >> Sorry, fuzzy version of what relationship? 
> >> 
> >> > using the face grammar example, you won’t get a situation where, for 
> instance, detected eye regions 
> >> 
> >> Detecting eyes will be very hard; that won't be possible until a 
> >> rather large and complex software stack is working. That's why I 
> >> suggested starting with something simple -- for vision, detecting 
> >> tricolor flags in canonical position. Or maybe a video camera aimed at 
> >> a room or a sidewalk or street where there is low activity. For audio, 
> >> perhaps shifts in volume and frequency distribution. 
> >> 
> >> I dunno -- do I need to try to think of other simple data streams? I 
> >> guess so ... Some people have recommended that video games be used as 
> >> input, but I really don't like that ... it seems too artificial. It's 
> >> problematic, for multiple reasons. What other kind of visual input is 
> >> simple enough to process, to be debuggable, as a proof-of-concept? 
> >> 
> >> > (like bounding boxes from an object detector) are exactly connected 
> together — there is going to be some distance in between. 
> >> 
> >> What bounding boxes? Why would bounding boxes be needed? What would 
> >> you do with them? 
> >> 
> >> > So how do you deal with this? 
> >> 
> >> You splatted a bunch of questions without defining any of the 
> >> terminology, so I don't know how to respond... you seem to be thinking 
> >> of something very different from what I'm thinking of ... but I can't 
> >> tell what that is ... 
> >> 
> >> > The STI chapter mentions certain computational difficulties with the 
> fuzzy approach and proposes that using some crude assumptions you could 
> have something that could then be trained on a dataset to further improve 
> it. 
> >> 
> >> What STI chapter? What fuzzy approach? Why do we need fuzzy-anything? 
> >> I thought I spelled out a rather specific, precise algorithm; the word 
> >> "fuzzy" did not appear in it ... 
> >> 
> >> > Is this part of the “learn” project or is there some other approach 
> to it? 
> >> 
> >> The "learn" project has maybe 300+ pages of docs, but the basic ideas 
> >> are spelled out in a bunch of README's and overviews. It is possible 
> >> that these fail to communicate the ideas correctly, and .. that's 
> >> fixable, but will take some time. I'd rather exchange emails and 
> >> take steps one-at-a-time, rather than send you out to read hundreds of 
> >> pages of stuff... 
> >> 
> >> > Well, we can reuse some of those for our purposes — a generic object 
> detection model can be used to spot all sorts of things on an image, 
> >> 
> >> Sure, but it will take many years if not a decade to build a "generic 
> >> object detection model". I don't think this is something easy or 
> >> quick -- that's the end-point, not the start point. 
> >> 
> >> > we just need to find one that was trained with a taxonomy that suits 
> us. 
> >> 
> >> Learning the "taxonomy" would be a rather advanced stage of the 
> >> project. I'm sort-of-ish exploring some basic aspects of something 
> >> like that at the NLP level, but so far, its mostly ideas and very 
> >> little functional code. It will be at least a year and probably a lot 
> >> more, before we can learn taxonomy of visual or audio inputs. There's 
> >> a huge amount of preliminaries that have to be gotten out of the way. 
> >> 
> >> > Using such models with OpenCog has been done already by Alexei 
> Potapov et al. if I remember correctly. It’s mostly a matter of adapting 
> that scheme to the specifics of this project. 
> >> 
> >> ? Alexy is working on something else entirely. I don't know what it 
> >> is, but its pretty much totally unrelated to what I'm working on... 
> >> unless he's keeping some secrets from me ... 
> >> 
> >> > The challange is, as always, to find data that has a model that can 
> detect things we want — with faces for example I can’t find detectors for 
> face parts but I can find models detecting key points, which includes 
> mouths and eyes. 
> >> 
> >> OK, this is a misunderstanding. The goal is to NOT use some 
> >> pre-trained, pre-built, human-engineered face detector trained on a 
> >> corpus carefully curated by humans. If you are using systems that are 
> >> hand-crafted, hand-curated by humans, its not AGI any more. I'm very 
> >> much trying to go in the exact opposite direction. The goal is to get 
> >> the human data-engineering out of the loop. 
> >> 
> >> Detecting faces will be hard. It *might* be possible, maybe, once 
> >> everything is wired up, tested, debugged, tuned, tweaked, re-designed 
> >> and re-written a few times. I doubt face detection will be achievable 
> >> any sooner than a year from now, and that's only if it's a year of 
> >> full-time hard work and a whole lot of luck. Otherwise, I think face 
> >> detection is probably out of reach, for the short term. Lots of much 
> >> more basic things have to come together, first. 
> >> 
> >> I dunno, maybe one could get magically lucky, but I doubt it ... 
> >> 
> >> -- linas 
> >> 
> >> 
> >> -- 
> >> Patrick: Are they laughing at us? 
> >> Sponge Bob: No, Patrick, they are laughing next to us. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "opencog" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected]. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/opencog/4447730d-70d7-47ac-b225-a0c19c36d64fn%40googlegroups.com.
>  
>
>
>
>
> -- 
> Patrick: Are they laughing at us? 
> Sponge Bob: No, Patrick, they are laughing next to us. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/996644d8-ac1c-4e96-a227-3d0804905819n%40googlegroups.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to