James, Thats great!

A next step would be to calculate some statistical characteristics of the 
collection. 
Typically:

- Size in Bytes of the Collection
- Size in Bytes of each Document
- Word count of the Collection (punctuation signs should count a words too)
- Word count of each Document (idem)
- Wordlist of the Collection (each occurring word has an entry)
- Wordlist of each Document (idem)
- Coverage of vocabulary of each Document in percent of the Collection 
vocabulary (maybe also unique vocabulary for each Document)

The last line will tell us if the coverage is evenly distributed over the 
different documents. We might eliminate some of them from the list if they 
don't match.

In the end we could make a script that gives each of the calculated items a 
speaking name, casts it as a constant and generates an include file. This makes 
it easy to create the evaluation code later.

Francisco


On 28.08.2013, at 18:47, James Tauber wrote:

> I've actually moved the texts to a full-blown GitHub repo:
> 
> https://github.com/jtauber/nupic-texts
> 
> so feel free to log issues against it if other changes are necessary and/or 
> fork and do pull requests if you want to change/add anything.
> 
> James
> 
> 
> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <jtau...@jtauber.com> wrote:
> All done:
> 
> https://gist.github.com/jtauber/6347309
> 
> 
> 
> 
> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <jtau...@jtauber.com> wrote:
> yep, I'm working on it :-)
> 
> 
> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <f.web...@cept.at> wrote:
> yes James that looks perfect.
> great job!
> Now we need the other tales in the same format.
> 
> Francisco
> 
> On 27.08.2013, at 15:14, James Tauber wrote:
> 
>> Let me know if this is what you had in mind (just the ugly duckling):
>> 
>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>> 
>> I put each paragraph on its own line and separated the sections (that 
>> formerly were separated by a row of asterisks) with a blank line. 
>> 
>> James
>> 
>> 
>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber 
>> <f.web...@cept.at> wrote:
>> James,
>> thats great!
>> I think that there are some more preparations necessary:
>> - All CRLF should be removed. Keeping one blank after each full stop. (This 
>> makes it easier for most parsers)
>> - The line of asterisks should be replaced by a CRLF to mark the paragraphs. 
>> (We never know but we could need paragraph info at some time)
>> - The file as such should be split into single tales. (Whatever experiments 
>> we run, if we rerun them with different tales, results become more 
>> comparable)
>> - The title should not be written in caps. (Capital letter+Full Stop is 
>> interpreted as acronym or middle name instead of a sentence delimiter)
>> 
>> Francisco
>> 
>> 
>> Am 27.08.2013 um 00:22 schrieb James Tauber <jtau...@jtauber.com>:
>> 
>>> I've removed the metadata, the vocab lists and the illustrations:
>>> 
>>> https://gist.github.com/jtauber/6347309
>>> 
>>> James
>>> 
>>> 
>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <jhawk...@numenta.org> wrote:
>>> I am sold on the kid’s story idea.  I looked at the link below and there is 
>>> a lot of meta data in this file.  It would have to be removed before 
>>> feeding to the CLA.
>>> 
>>>  
>>> 
>>> My assumption is that we would need a CLA with more columns than the 
>>> standard 2048.  How many bits are in your word fingerprints?  Could we make 
>>> each bit a column and skip the SP?
>>> 
>>> Jeff
>>> 
>>>  
>>> 
>>> From: nupic [mailto:nupic-boun...@lists.numenta.org] On Behalf Of Francisco 
>>> Webber
>>> Sent: Monday, August 26, 2013 3:50 AM
>>> 
>>> 
>>> To: NuPIC general mailing list.
>>> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>> 
>>>  
>>> 
>>> Ian,
>>> 
>>> I also thought about something from the Gutenberg repository.
>>> 
>>> But I think we should start with something from the Kids Shelf.
>>> 
>>>  
>>> 
>>> There are several reasons in my opinion:
>>> 
>>>  
>>> 
>>> - We start experimentation with a full bag of unknown parameters, so 
>>> keeping the test material simple would allow us to detect the important 
>>> ones sooner. And it is quite some work to create a reliable evaluation 
>>> framework, so the size of the data set makes a difference.
>>> 
>>> - Keeping the text simple and short reduces substantially the overall 
>>> vocabulary. If we want people to also evaluate offline, matching 
>>> fingerprints can become a lengthy process without an efficient similarity 
>>> engine.
>>> 
>>> - Another reason is the fact that we don't know how much a given set of 
>>> columns (like the 2048 typically used) can absorb information. In other 
>>> words: what is the optimal ratio between a first layer of a text-HTM and 
>>> the amount of text.
>>> 
>>> - Lastly I believe that the sequence in which text is presented to the CLA 
>>> is of importance. After all when humans learn information by reading, they 
>>> also start from simple to complex language. The amount of new vocabulary 
>>> during training, should be relatively stable (the actual amount would 
>>> probably be linked to the ratio of my previous argument) 
>>> 
>>>  
>>> 
>>> So we should build continuously more complex training data sets, finally 
>>> ending up with "true"  books like the ones you listed.
>>> 
>>>  
>>> 
>>> To start I would suggest something like:
>>> 
>>>  
>>> 
>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children
>>> 
>>> http://www.gutenberg.org/ebooks/7841
>>> 
>>>  
>>> 
>>> But there might still be better ones…
>>> 
>>>  
>>> 
>>> Francisco
>>> 
>>>  
>>> 
>>>  
>>> 
>>>  
>>> 
>>> On 25.08.2013, at 23:05, Ian Danforth wrote:
>>> 
>>> 
>>> 
>>> 
>>> I will make 3 suggestions. All are out of copyright, well known, 
>>> uncontroversial, and still taught in schools (At least in the US)
>>> 
>>>  
>>> 
>>> 1. Robinson Crusoe - Daniel Defoe
>>> 
>>>  
>>> 
>>> http://www.gutenberg.org/ebooks/521
>>> 
>>>  
>>> 
>>> 2. Great Expectations - Charles Dickens
>>> 
>>>  
>>> 
>>> http://www.gutenberg.org/ebooks/1400
>>> 
>>>  
>>> 
>>> 3. The Time Machine - H.G. Wells
>>> 
>>>  
>>> 
>>> http://www.gutenberg.org/ebooks/35
>>> 
>>>  
>>> 
>>> Ian
>>> 
>>>  
>>> 
>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <f.web...@cept.at> wrote:
>>> 
>>> For those who don't want to use the API and for evaluation purposes, I 
>>> would propose that we choose some reference text and I convert it into a 
>>> sequence of SDRs. This file could be used for training.
>>> 
>>> I would also generate a list of all words contained in the text, together 
>>> with their SDRs to be used as conversion table.
>>> 
>>> As a simple test measure we could feed a sequence of SDRs into a trained 
>>> network and see if the HTM makes the right prediction about the following 
>>> word(s). 
>>> 
>>> The last file to produce for a complete framework would be a list of lets 
>>> say 100 word sequences with their correct continuation.
>>> 
>>> The word sequences could be for example the beginnings of phrases with more 
>>> than n words (n being the number of steps ahead that the CLA can predict 
>>> ahead)
>>> 
>>> This could be the beginning of a measuring set-up that allows to compare 
>>> different CLA-implementation flavors.
>>> 
>>>  
>>> 
>>> Any suggestions for a text to choose?
>>> 
>>>  
>>> 
>>> Francisco
>>> 
>>>  
>>> 
>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:
>>> 
>>>  
>>> 
>>> Very cool, Francisco. Here is where you can get cept API credentials: 
>>> https://cept.3scale.net/signup
>>> 
>>> 
>>> 
>>> ---------
>>> 
>>> Matt Taylor
>>> 
>>> OS Community Flag-Bearer
>>> 
>>> Numenta
>>> 
>>>  
>>> 
>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <f.web...@cept.at> wrote:
>>> 
>>> Just a short post scriptum:
>>> 
>>> The public version of our API doesn't actually contain the generic 
>>> conversion function. But if people from the HTM community want to 
>>> experiment just click the "Request for Beta-Program" button and I will 
>>> upgrade your accounts manually.
>>> 
>>> Francisco
>>> 
>>> 
>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>> 
>>> > Jeff,
>>> > I thought about this already.
>>> > We have a REST API where you can send a word in and get the SDR back, and 
>>> > vice versa.
>>> > I invite all who want to experiment to try it out.
>>> > You just need to get credentials at our website: www.cept.at.
>>> >
>>> > In mid-term it would be cool to create some sort of evaluation set, that 
>>> > could be used to measure progress while improving the CLA.
>>> >
>>> > We are continuously improving our Retina but the version that is 
>>> > currently online works pretty well already.
>>> >
>>> > I hope that will help
>>> >
>>> > Francisco
>>> >
>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>> >
>>> >> Francisco,
>>> >> Your work is very cool.  Do you think it would be possible to make 
>>> >> available
>>> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
>>> >> imagine there would be interested in the NuPIC community in training a 
>>> >> CLA
>>> >> on text using your word SDRs.  You might get some useful results more
>>> >> quickly.  You could do this under a research only license or something 
>>> >> like
>>> >> that.
>>> >> Jeff
>>> >>
>>> >> -----Original Message-----
>>> >> From: nupic [mailto:nupic-boun...@lists.numenta.org] On Behalf Of 
>>> >> Francisco
>>> >> Webber
>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>> >> To: NuPIC general mailing list.
>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>> >>
>>> >> Hello,
>>> >> I am one of the founders of CEPT Systems and lead researcher of our 
>>> >> retina
>>> >> algorithm.
>>> >>
>>> >> We have developed a method to represent words by a bitmap pattern 
>>> >> capturing
>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill 
>>> >> all
>>> >> the requirements for "good" HTM input data.
>>> >>
>>> >> - Words with similar meaning "look" similar
>>> >> - If you drop random bits in the representation the semantics remain 
>>> >> intact
>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>> >> - Every bit in the representation corresponds to a specific semantic 
>>> >> feature
>>> >> of the language used
>>> >> - The Retina (sensory organ for a HTM) can be trained on any language
>>> >> - The retina training process is fully unsupervised.
>>> >>
>>> >> We have found out that the word-SDR by itself (without using any HTM yet)
>>> >> can improve many NLP problems that are only poorly solved using the
>>> >> traditional statistic approaches.
>>> >> We use the SDRs to:
>>> >> - Create fingerprints of text documents which allows us to compare them 
>>> >> for
>>> >> semantic similarity using simple (euclidian) similarity measures
>>> >> - We can automatically detect polysemy and disambiguate multiple 
>>> >> meanings.
>>> >> - We can characterize any text with context terms for automatic
>>> >> search-engine query-expansion .
>>> >>
>>> >> We hope to successfully link-up our Retina to an HTM network to go beyond
>>> >> lexical semantics into the field of "grammatical semantics".
>>> >> This would hopefully lead to improved abstracting-, conversation-, 
>>> >> question
>>> >> answering- and translation- systems..
>>> >>
>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>>> >>
>>> >> I am interested in any form of cooperation to apply HTM technology to 
>>> >> text.
>>> >>
>>> >> Francisco
>>> >>
>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>> >>
>>> >>>
>>> >>> Hello.
>>> >>>
>>> >>> As many of you here i am prety new in HTM technology.
>>> >>>
>>> >>> I am a researcher in Brazil and I am going to start my Phd program soon.
>>> >> My field of interest is NLP and the extraction of knowledge from text. I 
>>> >> am
>>> >> thinking to use the ideas behind the Memory Prediction Framework to
>>> >> investigate semantic information retrieval from the Web, and answer
>>> >> questions in natural language. I intend to use the HTM implementation as
>>> >> base to do this.
>>> >>>
>>> >>> I apreciate a lot if someone could answer some questions:
>>> >>>
>>> >>> - Are there some researches related to HTM and NLP? Could indicate them?
>>> >>>
>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>> >> supervision, the grammar of a language or just help in some aspects as 
>>> >> Named
>>> >> Entity Recognition?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Christian
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> nupic mailing list
>>> >>> nupic@lists.numenta.org
>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> nupic mailing list
>>> >> nupic@lists.numenta.org
>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> nupic mailing list
>>> >> nupic@lists.numenta.org
>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >
>>> >
>>> > _______________________________________________
>>> > nupic mailing list
>>> > nupic@lists.numenta.org
>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>>  
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>>  
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> James Tauber
>>> http://jtauber.com/
>>> @jtauber on Twitter
>>> _______________________________________________
>>> nupic mailing list
>>> nupic@lists.numenta.org
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> _______________________________________________
>> nupic mailing list
>> nupic@lists.numenta.org
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> 
>> 
>> -- 
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>> _______________________________________________
>> nupic mailing list
>> nupic@lists.numenta.org
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> _______________________________________________
> nupic mailing list
> nupic@lists.numenta.org
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> _______________________________________________
> nupic mailing list
> nupic@lists.numenta.org
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
nupic@lists.numenta.org
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to