Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Adrian Borucki Sun, 19 Sep 2021 09:57:24 -0700

Just to clarify: by “performance” I mean the rate of success on a given 
task, not necessarily speed.


Anyway: I’m afraid I can’t help with the visual processing part then — I 
know nothing of using wavelets for image analysis so I can’t really say 
anything further until how this is supposed to work is fully sorted out.

On Friday, 17 September 2021 at 22:19:47 UTC+2 linas wrote:

> Hi Adrian,
>
> On Thu, Sep 16, 2021 at 3:02 PM Adrian Borucki <[email protected]> wrote:
> >
> > Yeah, this is clear to me to now — the grammar learning part is kind of 
> a given, the real question is how well this “image predicate” learning can 
> go…
>
> Yes, that is a question. Based on current experience, I'll say "very
> far" or at least, "much farther than anyone else has gone". But that
> is rather speculative: it's based on what I've been learning in a 1D
> setting, and so any doubters or skeptics in the audience are
> justified in doubting. Basically, I'm proposing this because it looks
> promising.
>
> It does not help that I am just one person proposing a rather novel,
> radical, counter-cultural idea that flies in the face of conventional
> wisdom. I'm quite aware of this. My burden of proof is much higher,
> and I am trying to supply it as best as I can. Keep asking doubtful
> questions, this is maybe the most useful thing you can do right now.
> So I like how this is going. I'm only irritated that you can't read my
> mind :-)
>
> > This is a deep question as no one is even sure why neural nets 
> themselves work so well.
>
> Well, again, this goes in a very different direction. Here, the
> reason that it would "work so well" is much more obvious: we ourselves
> are very good at spotting part-whole structure. Why, in just a few
> minutes, I can write down the obvious grammar for stop lights: glowing
> red above yellow above green, surrounded by a painted yellow or black
> harness. This is "obvious", and detecting this in images seems like it
> should be pretty easy.
>
> This is in very sharp contrast to what neural nets do: you are right:
> when a neural net picks out a stoplight from an image, we have no idea
> how it is doing that. Perhaps somewhere in there are some weight
> vectors for red, yellow, green, but where are they? Where are they
> hiding? How do neural nets handle part-whole relationships? There is
> a paper (from Hinton?) stating that the part-whole relationship for
> neural nets is the grand challenge of the upcoming decades. By
> contrast, the part-whole relationship for grammars is "obvious".
>
> > What needs clarification is what the structure of this filter learning 
> would be — what is the algorithm and what direct learning objective is it 
> given?
>
> The exact same algo as in the existing grammar learning code, modulo
> needed tweaks. That code is debugged and works well. Getting it going
> on images does pose some serious challenges and open questions, but I
> think the general ideas survive.
>
> To recap that algo: given a set of inputs, one explores the parameter
> space, and looks for high mutual-information correlations between
> pairs. Once high-MI pairs are discovered, the dataset is passed over a
> second time, this time, creating maximal spanning trees. The tree
> edges are then cut to give the grammar components.
>
> The above yields extremely high-dimensional sparse vectors: dimension
> of a million. By comparison, the highest dimension that neural nets go
> up to is about a thousand. So this is one of the big differences
> between the two approaches. The other, of course, is that the basis is
> labelled symbolically: you can see exactly which basis element
> attaches to what ("red above yellow", etc.)
>
> I'm currently working on the best ways to cluster these vectors into
> groupings. Early results look pretty good, but also show that these
> can be made much better. I can say much more in this.
>
> > Like in the above example, where are all these filters and numerical 
> arguments even coming from?
>
> Randomly generated. With or without some sampling bias.
>
> > The numerical part is especially difficult, given that you seemingly 
> want to get some symbolic structure out of it.
>
> I don't understand this statement.
>
> >
> > Going back to neural nets, the obvious problem is that if we make one 
> big neural “filter” then you don’t know what is going on inside —
>
> That's correct.
>
> > so the learning will be “shallower”. The question is how much of a 
> problem this really is.
>
> Well, the leading lights of neural-net world claim that this is one of
> the grand challenges of the upcoming decades, and I won't argue with
> them about that.
>
> > Is learning down to the low-level filtering operations a viable approach 
> right now?
>
> Yes, absolutely, I think so. Obviously, I haven't convinced you yet.
> That is in part because I have not fully (clearly?) communicated the
> general idea, just yet.
>
> > An interesting research question is if you could train a neural net that 
> can be “queried”, possibly in natural language or some simple formal one, 
> so that the system on top of it can learn to “extract” various statements 
> about an image out of it — so these predicates would be essentially hooked 
> to some queries that get send to the underlying model.
>
> Sure, there are hundreds of people working on this, and they are
> making progress. You can go to seminars, new results are regularly
> presented on this.
>
> > Technically this probably falls somewhere in the Visual Question 
> Answering field… the challenge is that these models are trained to answer 
> questions about more abstract things like objects, not some low level 
> features of the image.
>
> Yes. Lack of a symbolic structure to neural nets impedes desirable
> applicatiions, such as symbolic reasoning.
>
> > The final big question is what can you really do after you get that 
> grammar? What sort of inferences? How useful they are?
>
> Well, for starters, if the system recognizes a stop light, you can ask
> it: "how do you know its a stop light?" and get an answer: "because
> red above yellow above green." you can ask "and what else?" and get
> the answer "on a painted black or yellow background" -- "and what
> else?" "the colors glow in the dark" "and what else?" "they are round"
> and what else" only one comes on at a time" "and what else?" "the
> cycle time varies from 30 second to three minutes" "what is a cycle
> time?" "the parameter on the time filter by which repetition repeats"
> "what do you mean by round?" the image area of the light is defined
> via a circular aperature filter".
>
> Good luck getting a neural net answering even one of those questions,
> never mind all of them.
>
> > The key thing here is that if you, say, have a system that classifies 
> pictures, if it being built on top of this whole grammar and filter 
> learning pipeline means it doesn’t achieve competitive performance with 
> neural nets then it’s difficult to see what the comparative advantage of it 
> is — beyond the obvious advantage of interpretability, but that won’t save 
> that solution if its performance is considerably lower.
>
> Really? The ability to do symbolic reasoning is valueless if it is
> slow? If the filter that recognizes that lights are round also
> appears in other grammatically meaningful situations, you can ask a
> question "what else is round?" "the sun, the moon, billiard balls,
> bowling balls, baseballs, basketballs". I think we are very very far
> away from having a neural net do that kind of question answering. I
> think this is well within reach of grammatical systems.
>
> Associations between symbols and the things they represent is the
> famous "symbol grounding problem", considered to be a very difficult,
> unsolved problem in AI. I'm sketching a technique that solves this
> problem. I think this is unique in the history of AI research. I don't
> see that anyone else has ever proposed a plausible solution to the
> symbol grounding problem.
>
> > Well, the problem is not really with grammars, that can definitely be 
> useful, but if that “filter sequence” part works poorly then it will 
> bottleneck the performance of the entire system.
>
> Learning it, or running it, once learned? Clearly, running it can be
> superfast .. even 1980's-era DSP's did image processing quite well.
> Even single-threaded CPU's have no particular problem; these days we
> have multi-core CPU's and oodles of GPU's.
>
> The learning algo is ..something else. There are two steps: Step one:
> can we get it to work, at any speed? (I think we can) Step two: can we
> get it to work fast? (Who knows -- compare to deep learning, which
> took decades of basic research spanning hundreds of PhD theses before
> it started running fast. You and I and whatever fan-base might
> materialize are not going to replicate a few thousand man-years of
> basic research into performance.)
>
> > If that low level layer outputs garbage, then all the upper layers get 
> garbage, and we know what happens when you have garbage inputs in this 
> field...
>
> Don't feed it garbage!
>
> --linas
>
> -- 
> Patrick: Are they laughing at us?
> Sponge Bob: No, Patrick, they are laughing next to us.
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/959e9352-cc5b-481d-9a85-a4fd0a587578n%40googlegroups.com.

Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Reply via email to