Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Adrian Borucki Wed, 15 Sep 2021 10:07:51 -0700


On Wednesday, 15 September 2021 at 13:59:19 UTC+2 Adrian Borucki wrote:


> Okay, some clarification is needed because there is a sentence
> > Now, if you have some filter collection that is able to detect eyes, 
> mouths and foreheads
>
> That suggests using some pre-existing (i. e. human-engineered) solution to 
> find things like eyes and mouths.
> That said, I’ve looked again into the README and see that you specifically 
> mention segmentation as an image processing step.
> That makes sense, as segmentation means assigning a label to each pixel of 
> the image. That means everything is going to be connected to something (in 
> the worst case that something being the background).
>
Eh, sorry for the confusion — I shouldn’t have used the term “connected”, 
 I confused myself with thinking about two different contexts but using the 
same word. “Adjacent” is the better word. Also it’s obviously not important 
when considering directional connections, it’s not necessary for two things 
to be adjacent to calculate that, say, one of them is to the left of the 
other.
 

>
> All in all, I think using a pre-existing segmentation model would for now 
> simplify the project but that is your call to make of course.
> I can’t really opine about anything related to doing the “classic” 
> Computer Vision — my only guess is that you can probably hook up OpenCV 
> with AtomSpace. Also a handcrafted segmentator is still a human-engineered 
> solution — I don’t know if it’s really simpler because it requires more 
> domain-specific knowledge to understand and modify and is not going to be 
> very robust.
> Anyway, for now I don’t see much more to discuss actually, when you have 
> decided what data to use we can just move on to implementing the basic 
> functionality.
>
> By the way there is also research into unsupervised segmentation with 
> models like MONet or GENESIS that could be trained on arbitrary data to try 
> to figure out what things to segment as “anonymous" objects.
> It is still in fairly early stages though — those models handle just 64x64 
> pixel images (now with colour, thankfully) and of course are not 
> particularly cheap to train… on an easy dataset perhaps using one of the 
> findable checkpoints from such pre-trained models would work, that would 
> have to be tested.
>
> On Wednesday, 15 September 2021 at 04:53:18 UTC+2 linas wrote:
>
>> Trimming back the first part of the conversation... 
>>
>> On Tue, Sep 14, 2021 at 9:09 AM Adrian Borucki <[email protected]> 
>> wrote: 
>>
>> > 
>> > Here’s the part I have questions about: how do you deal with the fact 
>> that the regions won’t often be connected? 
>>
>> I don't understand the question. What regions? Where did they come 
>> from? What do you mean by "region"? 
>>
>> > I am familiar with an idea of using Region Connection Calculus 
>> mentioned in places like “Symbol Grounding via Chaining of Morphisms” and 
>> chapter 17 on spatio-temporal inference from EGI vol. 2. 
>>
>> I'm not familiar with this. What is a "region"? 
>>
>> > And it seems you have to use fuzzy versions of these relationships 
>> because, 
>>
>> Sorry, fuzzy version of what relationship? 
>>
>> > using the face grammar example, you won’t get a situation where, for 
>> instance, detected eye regions 
>>
>> Detecting eyes will be very hard; that won't be possible until a 
>> rather large and complex software stack is working. That's why I 
>> suggested starting with something simple -- for vision, detecting 
>> tricolor flags in canonical position. Or maybe a video camera aimed at 
>> a room or a sidewalk or street where there is low activity. For audio, 
>> perhaps shifts in volume and frequency distribution. 
>>
>> I dunno -- do I need to try to think of other simple data streams? I 
>> guess so ... Some people have recommended that video games be used as 
>> input, but I really don't like that ... it seems too artificial. It's 
>> problematic, for multiple reasons. What other kind of visual input is 
>> simple enough to process, to be debuggable, as a proof-of-concept? 
>>
>> > (like bounding boxes from an object detector) are exactly connected 
>> together — there is going to be some distance in between. 
>>
>> What bounding boxes? Why would bounding boxes be needed? What would 
>> you do with them? 
>>
>> > So how do you deal with this? 
>>
>> You splatted a bunch of questions without defining any of the 
>> terminology, so I don't know how to respond... you seem to be thinking 
>> of something very different from what I'm thinking of ... but I can't 
>> tell what that is ... 
>>
>> > The STI chapter mentions certain computational difficulties with the 
>> fuzzy approach and proposes that using some crude assumptions you could 
>> have something that could then be trained on a dataset to further improve 
>> it. 
>>
>> What STI chapter? What fuzzy approach? Why do we need fuzzy-anything? 
>> I thought I spelled out a rather specific, precise algorithm; the word 
>> "fuzzy" did not appear in it ... 
>>
>> > Is this part of the “learn” project or is there some other approach to 
>> it? 
>>
>> The "learn" project has maybe 300+ pages of docs, but the basic ideas 
>> are spelled out in a bunch of README's and overviews. It is possible 
>> that these fail to communicate the ideas correctly, and .. that's 
>> fixable, but will take some time. I'd rather exchange emails and 
>> take steps one-at-a-time, rather than send you out to read hundreds of 
>> pages of stuff... 
>>
>> > Well, we can reuse some of those for our purposes — a generic object 
>> detection model can be used to spot all sorts of things on an image, 
>>
>> Sure, but it will take many years if not a decade to build a "generic 
>> object detection model". I don't think this is something easy or 
>> quick -- that's the end-point, not the start point. 
>>
>> > we just need to find one that was trained with a taxonomy that suits 
>> us. 
>>
>> Learning the "taxonomy" would be a rather advanced stage of the 
>> project. I'm sort-of-ish exploring some basic aspects of something 
>> like that at the NLP level, but so far, its mostly ideas and very 
>> little functional code. It will be at least a year and probably a lot 
>> more, before we can learn taxonomy of visual or audio inputs. There's 
>> a huge amount of preliminaries that have to be gotten out of the way. 
>>
>> > Using such models with OpenCog has been done already by Alexei Potapov 
>> et al. if I remember correctly. It’s mostly a matter of adapting that 
>> scheme to the specifics of this project. 
>>
>> ? Alexy is working on something else entirely. I don't know what it 
>> is, but its pretty much totally unrelated to what I'm working on... 
>> unless he's keeping some secrets from me ... 
>>
>> > The challange is, as always, to find data that has a model that can 
>> detect things we want — with faces for example I can’t find detectors for 
>> face parts but I can find models detecting key points, which includes 
>> mouths and eyes. 
>>
>> OK, this is a misunderstanding. The goal is to NOT use some 
>> pre-trained, pre-built, human-engineered face detector trained on a 
>> corpus carefully curated by humans. If you are using systems that are 
>> hand-crafted, hand-curated by humans, its not AGI any more. I'm very 
>> much trying to go in the exact opposite direction. The goal is to get 
>> the human data-engineering out of the loop. 
>>
>> Detecting faces will be hard. It *might* be possible, maybe, once 
>> everything is wired up, tested, debugged, tuned, tweaked, re-designed 
>> and re-written a few times. I doubt face detection will be achievable 
>> any sooner than a year from now, and that's only if it's a year of 
>> full-time hard work and a whole lot of luck. Otherwise, I think face 
>> detection is probably out of reach, for the short term. Lots of much 
>> more basic things have to come together, first. 
>>
>> I dunno, maybe one could get magically lucky, but I doubt it ... 
>>
>> -- linas 
>>
>>
>> -- 
>> Patrick: Are they laughing at us? 
>> Sponge Bob: No, Patrick, they are laughing next to us. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/d6b0d3b3-c047-4ad1-83d9-f65eb4be2205n%40googlegroups.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to