Hi all,
The API converts every thing to lowercase anyway, so no need for the extra 
effort.
All punctuation will be converted into a pattern too, so no need to filter them 
out either.

Francisco

On 12.09.2013, at 23:29, Marek Otahal wrote:

> I did not yet read up all in this thread, so sorry if completely wrong.. How 
> about going all lowercase, and removing non-aplha(numeric) characters if 
> necessary? 
> 
> regards, breznak
> 
> 
> On Thu, Aug 29, 2013 at 5:05 AM, James Tauber <[email protected]> wrote:
> I pushed a Python 3 script to my repo that does a bunch of calculations.
> 
> Here are the results of that script. Let me know what you'd like to see next. 
> I can already see one problem in the tokenization where 'No was not split.
> 
> FILENAME                            BYTES TOKEN  TYPE
> -----------------------------------------------------
> 01_the_ugly_duckling.txt             3143   782   207
> 02_the_little_pine_tree.txt          1635   388   104
> 03_the_little_match_girl.txt         3065   701   218
> 04_little_red_riding_hood.txt        2168   509   159
> 05_the_apples_of_idun.txt            3923   934   244
> 06_how_thor_got_the_hammer.txt       5857  1373   318
> 07_the_hammer_lost_and_found.txt     4260  1010   258
> 08_the_story_of_the_sheep.txt        1265   304   129
> 09_the_good_ship_argo.txt             889   209   107
> 10_jason_and_the_harpies.txt         2187   495   173
> 11_the_brass_bulls.txt               3487   786   239
> 12_jason_and_the_dragon.txt          1867   427   180
> -----------------------------------------------------
> COLLECTION                          33746  7918   882
> 
> Unique to 01_the_ugly_duckling.txt:
> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', 'lay', 
> 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', 'cat', 
> 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', 'lovely', 
> 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', 'corner', 'bread', 
> 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', 'Let', 'noise', 'hen', 
> 'ducklings', 'Only', 'Stay', 'Duckling'}
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to