Again, there's a misunderstanding here. Yes, PCA is not composable, sheaves are. i'm using sheaves. The reason that I looked at PCA was to use a thresholded, sparse PCA for CLUSTERING. and NOT similarity where compositionality does not matter. Its really a completely different concept, quite totally unrelated, which just happens to have the three letters PCA in it. Perhaps I should have called it a "sigmoid-thresholded eigenvector classifier" instead, because that's what I'm trying to talk about.
--linas On Mon, Jun 19, 2017 at 2:11 PM, Hugo Latapie (hlatapie) <[email protected] > wrote: > Hi Everyone… I have a lot of ramping-up to do here. > > > > Following this interesting thread, initially thinking about optimal > clustering of various distributed representations led me to this paper: > > Ferrone, Lorenzo, and Fabio Massimo Zanzotto. "Symbolic, Distributed and > Distributional Representations for Natural Language Processing in the Era > of Deep Learning: a Survey." *arXiv preprint arXiv:1702.00764* (2017). > > > > Which emphasized the importance of semantic composability, as we were > discussing Ben. They also show that PCA are not composable in this sense. > They show random indexing solves some of these problems when compacting > distributional semantic vectors. > > > > Holographic reduced representations look promising. > > > > BTW if we can help with some of the grunge work, creating that Jupyter > notebook (or suitable equivalent), Karthik may be able to help. Of course > with your guidance. > > > > Cheers, > > > > Hugo > > > > *From:* Linas Vepstas [mailto:[email protected]] > *Sent:* Monday, June 19, 2017 10:24 AM > *To:* link-grammar <[email protected]> > *Cc:* opencog <[email protected]>; Ruiting Lian < > [email protected]>; Word Grammar <[email protected]>; > Zarathustra Goertzel <[email protected]>; Hugo deGaris < > [email protected]>; Enzo Fenoglio (efenogli) <[email protected]>; > Hugo Latapie (hlatapie) <[email protected]> > *Subject:* Re: [Link Grammar] Cosine similarity, PCA, sheaves (algebraic > topology) > > > > OK, well, some quick comments: > > -- sparsity is a good thing, not a bad thing. It's one of the big > indicators that we're on the right track: instead of seeing that everything > is like everything else, we're seeing that only one of of every 2^15 or > 2^16 possibilities are actually being observed! So that's very very good! > The sparser, the better! Seriously, this alone is a major achievement, I > think. > > -- The reason I was trumpeting about hooking up EvaluationLinks to R was > precisely because this opens up many avenues about data analysis. Right > now, the data is trapped in the atomspace, and its a lot of work, for me, > to get it out, to get it to where I can apply interesting algorithms to > it. > > (Personally, I have no plans to do anything with R. Just that making this > hookup is the right thing to do, in principle.) > > > > The urgent problem for me is not that I'm lacking algorithms; the problem > for me is that I don't have any easy, effective, quick way of applying the > algos to the data. There's no jupyter notebook where you punch the monkey > and your data is analyzed. This is where all my time, all the heavy lifting > is going. > > -- Don't get hung up on point samples. > > "He was going to..." "There was going to..." > > There was a tool house, plenty of > There isn’t any half > There is more, the sky is > There was almost no breeze. > There he had > There wasn’t a thing said about comin’ > There was a light in the kitchen, but Mrs. > There was a rasping brush against the tall, dry swamp > There was a hasty consultation, and this program was > There was a bob of the flat boat > There was time only for > There was a crash, > There was the low hum of propellers, and the whirr of the > There was no rear entrance leading > There was a final quick dash down the gully road, > There came a time when the > There ye said it. > There came to him faintly the sound of a voice > There may be and probably is some exaggeration > There he took a beautiful little mulatto slave as his > There flew the severed hand and dripped the bleeding heart. > There must sometimes be a physical > There remains then a kind of life of > There are principally three things moving us to choice and three > > > He had not yet seen the valuable > He was slowly > He may be able to do what you want, and he may not. You may > He lit a cigar, looked at his watch, examined Bud in the > He was heating and > He stammered > He looked from one lad to the other. > He answered angrily in the same language. > He was restless and irritable, and every now > He had passed beyond the > He was at least three hundred > He could not even make out the lines of the fences beneath > He had thoughtlessly started in > He was over a field of corn in the shock. > He had surely gone a mile! In the still night air came a > He fancied he heard the soft lap of water just ahead. That > He had slept late, worn > He was a small man, with > He ain’t no gypsy, an’ he ain’t no > He was dead, too, then. The place was yours because > He knew he had enough fuel to carry > He meant to return to the fair, give the advertised exhibition > He returned to his waiting friends > > > > On Mon, Jun 19, 2017 at 11:23 AM, Ben Goertzel <[email protected]> wrote: > > On Tue, Jun 20, 2017 at 12:07 AM, Linas Vepstas <[email protected]> > wrote: > > So again, this is not where the action is. What we need is accurate, > > high-performance, non-ad-hoc clustering. I guess I'm ready to accept > > agglomerative clustering, if there's nothing else that's simpler, better. > > > We don't need just clustering, we need clustering together with sense > disambiguation... > > I believe that we will get better clustering (and better > clustering-coupled-with-disambiguation) results out of the vectors > Adagram produces, than out of the sparse vectors you're now trying to > cluster.... But this is an empirical issue, we can try both and > see... > > As for the corpus size, I mean, in a bigger corpus "He" and "There" > (with caps) would also not come out as so similar.... > > But yes, the list of "very similar word pairs" you give is cool and > impressive.... > > It would be interesting to try EM clustering, or maybe a variant like this, > > https://cran.r-project.org/web/packages/HDclassif/index.html > > on your feature vectors .... > > We will try this on features we export ourselves, it if we can get the > language learning pipeline working correctly.... (I know we could > just take the feature vectors you have produced and play with them, > but I would really like us to be able to get the language learning > pipeline working adequately in Hong Kong -- obviously, as you know, > this is an important project and we can't have it in "it works on my > machine" status ...) > > I would like to try EM and variants on both your raw feature vectors, > and on reduced/disambiguated feature vectors that modified-Adagram > spits out based on your MST parse trees.... It will be interesting > to compare the clusters obtained from these two approaches... > > -- Ben > > > > > > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin > > -- > You received this message because you are subscribed to the Google Groups > "link-grammar" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/link-grammar. > For more options, visit https://groups.google.com/d/optout. > > > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35wFb8fHgxHd8CUA0RsPWf7aTb2BqEuHEFc7BO1tMFQzA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
