Hi Ben, By RCC, I guess you mean the "region calculus"? This isn't that. This is more like moses-for-images. Except it's unsupervised. So more like "pattern miner for images". Except it's not using the pattern miner infrastructure, it's using the vector+matrix infrastructure.
--linas On Tue, Sep 21, 2021 at 8:25 PM 'Ben Goertzel' via opencog <[email protected]> wrote: > > Hi, the RCC stuff was work done by me and Keyvan Sadeghi quite some > years ago, which was paused not because it wasn't working but because > Keyvan moved on to other stuff... > > Linas never dealt with that stuff, as far as I recall ... > > I think to make that sort of approach work scalably, you would need to > use a hybrid inference engine that uses a specialized prover for > fuzzy-RCC, interoperating with a general-purpose PLN prover for > general conceptual relationships among the entities occupying the > regions... But we never got there and shifted attention to other > things... > > ben > > On Tue, Sep 14, 2021 at 7:10 AM Adrian Borucki <[email protected]> wrote: > > > > > > > > On Monday, 13 September 2021 at 19:53:55 UTC+2 linas wrote: > >> > >> On Mon, Sep 13, 2021 at 6:49 AM Adrian Borucki <[email protected]> wrote: > >> > > >> > On Sunday, 12 September 2021 at 18:55:23 UTC+2 linas wrote: > >> >> > >> >> On Sun, Sep 12, 2021 at 8:29 AM Adrian Borucki <[email protected]> > >> >> wrote: > >> >> > > >> >> >> ---- > >> >> >> As to divine intervention vs. bumbling around: I'm still working on > >> >> >> unsupervised learning, which I hope will someday be able to learn the > >> >> >> rules of (common-sense) inference. I think I know how to apply it to > >> >> >> audio and video data, and am looking for anyone who is willing to get > >> >> >> neck-deep in both code and theory. In particular, for audio and > >> >> >> video, I need someone who knows GPU audio/video processing libraries, > >> >> >> and is willing to learn how to wrap them in Atomese. For starters. > >> >> > > >> >> > > >> >> > I might have some time to help with this - I only did a bit of video > >> >> > / audio processing for ML but I have > >> >> > some familiarity of AtomSpace, so that part should be easier. > >> >> > > >> >> > >> >> Wow! That would be awesome! > >> >> > >> >> I thought some more about the initial steps. A large part of this > >> >> would be setting up video/audio filters to run on GPU's, with the goal > >> >> of being able to encode the filtering pipeline in Atomese -- so that > >> >> expressions like "apply this filter then that filer then combine this > >> >> and that" are stored as expressions in the AtomSpace. > >> >> > >> >> The research program would then be to look for structural correlations > >> >> in the data. Generate some "random" filter sequences (building on > >> >> previously "known good" filter structures) and see if they have > >> >> "meaningful" correlations in them. Build up a vocabulary of "known > >> >> good" filter sequences. > >> >> > >> >> One tricky part is finding something simple to start with. I imagined > >> >> the local webcam feed: it should be able to detect when I'm in front > >> >> of the keyboard, and when not, and rank that as an "interesting" fact. > >> > > >> > > >> > Sounds like something that would be processed with a library like OpenCV > >> > — it’s important to distinguish between > >> > video data loading and using GPU-accelerated operations. My experience > >> > with the latter is very small as this is something usually wrapped with > >> > some > >> > library like PyTorch or RAPIDS. Also there is a difference between > >> > running something on-line vs batch processing of a dataset — you mostly > >> > gain from GPU acceleration > >> > when working with the latter, unless it’s something computationally > >> > expensive that’s supposed to run in real time. > >> > > >> > First, we need to elucidate what actual “filters” are supposed to be > >> > used — when we have a list I can think about how the operations would be > >> > run. > >> > Second, if you don’t have an existing dataset that we can use then we > >> > have to build one, that is probably the most time and resource-consuming > >> > task here… probably should be done first actually. > >> > There are existing video datasets that might be useful, it’s worth > >> > looking into those. > >> > >> Good. Before that, though, I think we need to share a general vision > >> of what the project "actually is", because that will determine > >> datasets, libraries, etc. I tried to write those down in a file > >> https://github.com/opencog/learn/blob/master/README-Vision.md -- but > >> it is missing important details, so let me try an alternate sketch. > >> > >> So here's an anecdote from Sophia the Robot: she had this habit of > >> trying to talk through an audience clapping. Basically, she could not > >> hear, and didn't know to pause when the audience clapped. (Yes, almost > >> all her performances are scripted. Some small fraction are ad libbed.) > >> A manual operator in the audience would have to hit a pause button, to > >> keep her from rambling on. So I thought: "How can I build a clap > >> detector?" Well, it would have to be some kind of audio filter -- some > >> level of white noise (broad spectrum noise), but with that peculiar > >> clapping sound (so, not pure white noise, but dense shot noise.) > >> Elevated above a threshold T for some time period of S at least one > >> second long. It is useful to think of this as a wiring diagram: some > >> boxes connected with lines; each box might have some control > >> parameters: length, threshold, time, frequency. > >> > >> So how do I build a clap detector? Well, download some suitable audio > >> library, get some sound samples, and start trying to wire up some > >> threshold detector *by hand*. Oooof. Yes, you can do it that way: > >> classical engineering. After that, you have a dozen different other > >> situations: booing. Laughing. Tense silence. Chairs scraping. And > >> after that, a few hundred more... it's impossible to hand-design a > >> filter set for every interesting case. So, instead: unleash automated > >> learning. That is, represent the boxes and wires as Nodes and Links > >> in the AtomSpace (the audio stream itself would be an > >> AudioStreamValue) and let some automated algo rearrange the wiring > >> diagram until it finds a good one. > >> > >> But what is a "good wiring diagram"? Well, the current very > >> fashionable approach is to develop a curated labelled training set, > >> and train on that. "Curated" means "organized by humans" (Ooof-dah. > >> humans in the loop again!) and "labelled" means each snippet has a > >> tag: "clapping" - "cheering" - "yelling". (Yuck. What kind of yelling? > >> Happy? Hostile? Asking for help? Are the labels even correct?) This > >> might be the way people train neural nets, but really, its the wrong > >> approach for AGI. I don't want to do supervised training. (I mean, we > >> could do supervised training in the opencog framework, but I don't see > >> any value in that, right now.) So, lets do unsupervised training. > >> > >> But how? Now for a conceptual leap. This leap is hard to explain in > >> terms of audio filters (its rather abstract) so I want to switch to > >> vision, before getting back to audio. For vision, I claim there > >> exists something called a "shape grammar". I hinted at this in the > >> last email. A human face has a shape to it - a pair of eyes, > >> symmetrically arranged above a mouth, in good proportion, etc. This > >> shape has a "grammar" that looks like this: > >> > >> left-eye: (connects-to-right-to-right-eye) and > >> (connects-below-to-mouth) and (connects-above-to-forehead); > >> forehead: (connects-below-to-left-eye) and > >> (connects-below-to-right-eye) and (connects-above-to-any-background); > >> > >> Now, if you have some filter collection that is able to detect eyes, > >> mouths and foreheads, you can verify whether you have detected an > >> actual face by checking against the above grammar. If all of the > >> connectors are satisfied, then you have a "grammatically correct > > > > > > Here’s the part I have questions about: how do you deal with the fact that > > the regions won’t often be connected? > > I am familiar with an idea of using Region Connection Calculus mentioned in > > places like “Symbol Grounding via Chaining of Morphisms” and chapter 17 on > > spatio-temporal inference from EGI vol. 2. > > And it seems you have to use fuzzy versions of these relationships because, > > using the face grammar example, you won’t get a situation where, for > > instance, detected eye regions (like bounding boxes from an object > > detector) are exactly connected together — there is going to be some > > distance in between. > > > > So how do you deal with this? The STI chapter mentions certain > > computational difficulties with the fuzzy approach and proposes that using > > some crude assumptions you could have something that could then be trained > > on a dataset to further improve it. > > Is this part of the “learn” project or is there some other approach to it? > > > >> description of a face". So, although your filter collection was > >> plucking eye-like and mouth-like features out of an image, the fact > >> that they could be arranged into a grammatically-correct arrangement > >> raises your confidence that you are seeing a face. > >> > >> Those people familiar with Link Grammar will recognize the above as a > >> peculiar variant of a Link-Grammar dictionary. (and thus I am cc'ing > >> the mailing list.) > >> > >> But where did the grammar come from? For that matter, where did the > >> eye and mouth filters come from? It certainly would be a mistake to > >> have an army of grad students writing shape grammars by hand. The > >> grammar has to be learned automatically, in an unsupervised fashion. > >> ... and that is what the opencog/learn project is all about. > >> > >> At this point, things become very highly abstract very quickly, and I > >> will cut this email short. Very roughly, though: one looks for > >> pair-wise correlations in data. Having found good pairs, one then > >> draws maximum spanning trees (or maximum planar graphs) with those > >> pairs, and extracts frequently-occurring vertex-types, and their > >> associated connectors. That gives you a raw grammar. Generalization > >> requires clustering specific instances of this into general forms. I'm > >> working on those algos now. > >> > >> The above can learn (should be able to learn) both a "shape grammar" > >> and also a "filter grammar" ("meaningful" combinations of processing > >> filters. Meaningful, in that they extract correlations in the data.) > >> > >> So that is the general idea. Now, to get back to your question: what > >> sort of video (or audio) library? What sort of dataset? I dunno. > >> Beats me. Best to start small: find some incredibly simple problem, > >> and prove that the general idea works on that. Scale up from there. > >> You get to pick that problem, according to taste. > >> > >> One idea was to build a "French flag detector": this should be "easy": > >> its just three color bars, one above the other. The grammar is very > >> simple. The training set might be a bunch of French flags. Now, if > >> the goal is to ONLY learn the shape grammar, then you have to hack up, > >> by hand, some adhoc color and hue and contrast filters. If you want to > >> learn the filter grammar, then .. well, that's a lot harder for > >> vision, because almost all images are extremely information-rich. The > >> training corpus would have to be selected to be very simple: only > >> those flags in canonical position (not draped) Then, either one has > >> extremely simple backgrounds, or one has a very large corpus, as > >> otherwise, you risk training on something in the background, instead > >> of the flags. > >> > >> For automated filter-grammars, perhaps audio is simpler? Because most > >> audio samples are not as information-rich as video/photos? > >> > >> I dunno. This is where it becomes hard. Even before all the fancy > >> theory and what-not, finding a suitable toy problem that is solvable > >> without a hopeless amount of CPU -processing and practical stumbling > >> blocks .. that's hard. Even worse is that state-of-the-art neural-net > >> systems have billions of CPU-hours behind them, computed with > >> well-written, well-debugged, highly optimized software, created by > >> armies of salaried PhD's working at the big tech companies. Any > >> results we get will look pathetic, compared to what those systems can > >> do. > > > > > > Well, we can reuse some of those for our purposes — a generic object > > detection model can be used to spot all sorts of things on an image, we > > just need to find one that was trained with a taxonomy that suits us. > > Using such models with OpenCog has been done already by Alexei Potapov et > > al. if I remember correctly. It’s mostly a matter of adapting that scheme > > to the specifics of this project. > > > > The challange is, as always, to find data that has a model that can detect > > things we want — with faces for example I can’t find detectors for face > > parts but I can find models detecting key points, which includes mouths and > > eyes. > > (like this library: https://github.com/open-mmlab/mmpose with this dataset: > > https://github.com/jin-s13/COCO-WholeBody) > > > >> > >> > >> The reason I find it promising is this: All those neural net systems > >> do is supervised training. They don't actually "think", they don't > >> need to. They don't need to find relationships out of thin air. So I > >> think this is something brand new that we're doing that no one else > >> does. Another key difference is that we are working explicitly at the > >> symbolic level. By having a grammar, we have an explicit part-whole > >> relationship. This is something the neural-net guys cannot do (Hinton, > >> I believe, has a paper on how one day in the distant future, neural > >> nets might be able to solve the part-whole relationship problem. By > >> contrast, we've already solved it, more or less from day one.) > >> > >> We've also "solved" the "symbol grounding problem" -- from day one. > >> This is another problem that AI researchers have been wringing their > >> hands about, from the 1960's onwards. Our symbols are grounded, from > >> the start: our symbols are the filter sets, the grammatical dictionary > >> entries, and we "know what they mean" because they work with explicit > >> data. > >> > >> Another very old AI problem is the "frame problem", and I think that > >> we've got that one licked, too, although this is a far more tenuous > >> claim. The "frame problem" is one of selecting only those things that > >> are relevant to a particular reasoning problem, and ignoring all of > >> the rest. Well, hey: this is exactly what grammars do: they tell you > >> exactly what is relevant, and they ignore the rest. The grammars have > >> learned to ignore the background features that don't affect the > >> current situation. But whatever... This gets abstract and can lead to > >> an endless spill of words. I am much more interested in creating > >> software that actually works. > >> > >> So .. that's it. What are the next steps? How can we do this? > >> > >> -- Linas > >> > >> -- > >> Patrick: Are they laughing at us? > >> Sponge Bob: No, Patrick, they are laughing next to us. > > > > -- > > You received this message because you are subscribed to the Google Groups > > "opencog" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > To view this discussion on the web visit > > https://groups.google.com/d/msgid/opencog/af835338-4511-4251-97d2-89865efce045n%40googlegroups.com. > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > “He not busy being born is busy dying" -- Bob Dylan > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/opencog/CACYTDBe4UhapG-JwNRtSk8iyypn6maTA1UoTQY%3DsrUvi-8odng%40mail.gmail.com. -- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us. -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34FJ7k%2Bgug9929Mb_14rg7UZCqPDpmS7hFK4eVQLdtTyA%40mail.gmail.com.
