I intend to try it out as well, but I didn't see any data dump from the
CEPT API yet (I know people were talking about it in the other thread). Let
me know if you make any progress on this! If you're interested, we can even
work together on it.


On Thu, Sep 12, 2013 at 1:56 PM, Matthew Taylor <m...@numenta.org> wrote:

> Looks like Chetan is using the nupic-texts project as a data source for
> his "linguist" project.
>
>
> http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2013-August/001040.html
> https://github.com/chetan51/linguist
>
> It is using a category encoder on each letter, and predicts the next
> letter within a sequence. In order to predict words in sequence instead of
> letters, I'm going to try to see how easy it will be to get word SDRs out
> of the CEPT API and input into nupic instead of letters. I'm not sure how
> much time I'll be able to spend on this, but if I get anything worthwhile,
> I'll put it up on github.
>
> Has anyone done something like this with the CEPT API yet?
>
> ---------
> Matt Taylor
> OS Community Flag-Bearer
>  Numenta
>
>
> On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <f.web...@cept.at>wrote:
>
>> James,
>> This looks great!
>> Yes the apostrophe tricked the parser …
>> We could simply edit this in the source file and recompute the stats. In
>> terms of punctuation we should just keep comma, full stop, question mark,
>> exclamation mark. Semicolon should be changed into comma.
>> Even if they might not appear in these texts its always good to make the
>> code fail safe in this concern.
>> Apostrophes and quotes are usually a mess. There must be something like
>> 250 character codes in UTF-8 that produce some character that can behave
>> like quotes…
>> Best would be to replace them with blanks. Words that are in the reduced
>> form like haven't should be taken as one word including the apostrophe.
>> I will work out the next steps over the weekend and post my achievements
>> on Monday.
>> I also need to make some adaptions on the Retina for the CLA link-up.
>> Looks like text 2 and 9 drop out the line a bit… maybe we should use them
>> just for doing unseen text tests. As they have few exclusive words. I will
>> give it another thought….
>>
>> Thanks for your support.
>>
>> Francisco
>>
>>
>> On 29.08.2013, at 05:05, James Tauber wrote:
>>
>> I pushed a Python 3 script to my repo that does a bunch of calculations.
>>
>> Here are the results of that script. Let me know what you'd like to see
>> next. I can already see one problem in the tokenization where 'No was
>> not split.
>>
>> FILENAME                            BYTES TOKEN  TYPE
>> -----------------------------------------------------
>> 01_the_ugly_duckling.txt             3143   782   207
>> 02_the_little_pine_tree.txt          1635   388   104
>> 03_the_little_match_girl.txt         3065   701   218
>> 04_little_red_riding_hood.txt        2168   509   159
>> 05_the_apples_of_idun.txt            3923   934   244
>> 06_how_thor_got_the_hammer.txt       5857  1373   318
>> 07_the_hammer_lost_and_found.txt     4260  1010   258
>> 08_the_story_of_the_sheep.txt        1265   304   129
>> 09_the_good_ship_argo.txt             889   209   107
>> 10_jason_and_the_harpies.txt         2187   495   173
>> 11_the_brass_bulls.txt               3487   786   239
>> 12_jason_and_the_dragon.txt          1867   427   180
>> -----------------------------------------------------
>> COLLECTION                          33746  7918   882
>>
>> Unique to 01_the_ugly_duckling.txt:
>> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs',
>> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug',
>> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly',
>> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest',
>> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An',
>> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'}
>>
>> Unique to 02_the_little_pine_tree.txt:
>> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night',
>> 'pine', 'nor', 'glass', 'Again'}
>>
>> Unique to 03_the_little_match_girl.txt:
>> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches',
>> 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match',
>> 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove',
>> 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet',
>> 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow',
>> 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her',
>> 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven',
>> 'Gretchen', 'room', 'colder', 'candles', 'Christmas'}
>>
>> Unique to 04_little_red_riding_hood.txt:
>> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter',
>> 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's",
>> 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red',
>> 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'}
>>
>> Unique to 05_the_apples_of_idun.txt:
>> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath',
>> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut',
>> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story',
>> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed',
>> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough',
>> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'}
>>
>> Unique to 06_how_thor_got_the_hammer.txt:
>> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket',
>> 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs',
>> "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did',
>> 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok',
>> 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin',
>> 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse',
>> 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs',
>> 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning',
>> "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked',
>> 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody',
>> 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'}
>>
>> Unique to 07_the_hammer_lost_and_found.txt:
>> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight',
>> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant',
>> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace',
>> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten',
>> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put',
>> "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja',
>> 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others',
>> 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried',
>> 'Still', 'talked', 'mead', 'whirled', 'wagon'}
>>
>> Unique to 08_the_story_of_the_sheep.txt:
>> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle',
>> 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed',
>> 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky',
>> 'Every', 'tight'}
>>
>> Unique to 09_the_good_ship_argo.txt:
>> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild',
>> 'bridge', 'party', 'invited'}
>>
>> Unique to 10_jason_and_the_harpies.txt:
>> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin',
>> 'drive', 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched',
>> 'hill', 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes',
>> 'together', 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'}
>>
>> Unique to 11_the_brass_bulls.txt:
>> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant',
>> 'pushed', 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well',
>> 'place', 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow',
>> 'Brass', "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths',
>> 'sword', 'noon', 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall',
>> 'lie', 'heads', 'Early', 'larger', 'Nothing'}
>>
>> Unique to 12_jason_and_the_dragon.txt:
>> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died',
>> 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine',
>> 'Dragon'}
>>
>>
>>  _______________________________________________
>> nupic mailing list
>> nupic@lists.numenta.org
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>>
>> _______________________________________________
>> nupic mailing list
>> nupic@lists.numenta.org
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> nupic@lists.numenta.org
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
_______________________________________________
nupic mailing list
nupic@lists.numenta.org
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to