Hi Ivan, You're completely correct about increasing the amount of information coming in, but that is precisely what we want.
If you just want to treat words as abstract labelled symbols, then we already have methods for that. In that case all your doing is learning the sequence of objects, without any "understanding" about *why* the sequence is as it is. You could just as easily be learning sequences of notes (and not learning anything about the rules or "quality" of music), or, as we used to watch on BBC's Generation Game, a sequence of consumer goods passing by on a conveyor belt. Each successive object gives the CLA no information about what the the next object should be. When you add all this semantic information using the CEPT data, you're learning what *kinds of words* fit together in a sequence. Humans learn language based on categorisation, inference and generalisation, so the stream of words must contain structural information which allows this to happen. For simple English sentences, for example, the first object is usually a noun (the subject of the sentence), and the second object is usually a verb which agrees in number with the subject. If the verb is transitive, this is usually followed by another noun, the object of the sentence. Variations on this pattern exist, but each one operates on its own very strict set of rules, again providing a structure which is contained in the (generalised) grammar of the language, the details of which are contained in the experienced streams of words. As we learn English (or any language), we learn the semantic category and grammatical settings for each word along with the meaning and sound of the word. We also learn the rules for fitting these things together in grammatically correct and semantically sensible ways. We can generalise and innovate based on the constraints of these rules, and we can detect and error-correct when the rules are broken. The language itself is learned in this holistic, holographic way, extracted only from what we hear and learn to think and produce (there is likely to be some hardwired "Universal Grammar" in our brains which is probably implemented by the way that various regions are wired together). Pinker's *The Language Instinct* is a great survey of these ideas. So, this exercise is a first set of steps in exploring how we can use the CLA to interface with this semantic structure of natural language. We're going to see how well a single layer, single region, small CLA deals with the structural and semantic information about sentences which the CEPT encoding provides. By the way, this is also a test of the usefulness of the CEPT concept and implementation, which itself was motivated by Francisco's awareness of the work Jeff et al have been doing on the CLA. We'll be seeing how well the CEPT data embodies the kind of semantic information we believe is needed for use in NLP. CEPT have already begun adapting their software in response to feedback from the NuPIC/Grok side, and they're clearly very interested in using our work and the hackathon to improve the CEPT SDR system and its power as an NLP encoding scheme. In terms of capacity, there's not much to worry about, IMHO. The CEPT SDR's are 16k bitmaps, and a 2k CLA should be able to deal with that. The input here is really an emulation of a (hypothetical) high-level language processing region being fed with a lower- (but still high-) level data stream in the language of the neocortex - an SDR. The 16k of data is really a set of sub-SDR's, each of which is encoding a different semantic aspect of the underlying word. This is similar (structurally) to the hierarchical organisation which we know exists in the brain for handling language, and indeed for most corticocortical hierarchy. We'll find out how well this works when we start the exploration, and we will easily be able to up the size of the CLA if we want to try that. Personally, I think that a 2k CLA has an enormous capacity - much higher than you would think - because of the way that it efficiently deploys connections and predictor cells based solely on the actual content and structure in the data. Regards, Fergal Byrne On Sat, Sep 14, 2013 at 7:52 AM, Ivan Sytenko <[email protected]> wrote: > Jeff, > > I see a methodological problem here. > > CEPT representation of the word increases the size of the code with > semantic information. This increases the size of the text presentation. > > The semantic information is not required in the task of remembering. This > information will overload your system. > > On the other hand, you increase the words intersection. This strongly > complicates the semantic analysis too. > > Do not get me wrong. I want to save people time and computers time only. > > Please do not explain if I'm wrong. > > Ivan > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- Fergal Byrne ExamSupport/StudyHub [email protected] http://www.examsupport.ie Dublin in Bits [email protected] http://www.inbits.com +353 83 4214179 Formerly of Adnet [email protected] http://www.adnet.ie
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
