I did not yet read up all in this thread, so sorry if completely wrong..
How about going all lowercase, and removing non-aplha(numeric) characters
if necessary?

regards, breznak


On Thu, Aug 29, 2013 at 5:05 AM, James Tauber <[email protected]> wrote:

> I pushed a Python 3 script to my repo that does a bunch of calculations.
>
> Here are the results of that script. Let me know what you'd like to see
> next. I can already see one problem in the tokenization where 'No was not
> split.
>
> FILENAME                            BYTES TOKEN  TYPE
> -----------------------------------------------------
> 01_the_ugly_duckling.txt             3143   782   207
> 02_the_little_pine_tree.txt          1635   388   104
> 03_the_little_match_girl.txt         3065   701   218
> 04_little_red_riding_hood.txt        2168   509   159
> 05_the_apples_of_idun.txt            3923   934   244
> 06_how_thor_got_the_hammer.txt       5857  1373   318
> 07_the_hammer_lost_and_found.txt     4260  1010   258
> 08_the_story_of_the_sheep.txt        1265   304   129
> 09_the_good_ship_argo.txt             889   209   107
> 10_jason_and_the_harpies.txt         2187   495   173
> 11_the_brass_bulls.txt               3487   786   239
> 12_jason_and_the_dragon.txt          1867   427   180
> -----------------------------------------------------
> COLLECTION                          33746  7918   882
>
> Unique to 01_the_ugly_duckling.txt:
> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs',
> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug',
> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly',
> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest',
> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An',
> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'}
>
>
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to