I intend to try it out as well, but I didn't see any data dump from the CEPT API yet (I know people were talking about it in the other thread). Let me know if you make any progress on this! If you're interested, we can even work together on it.
On Thu, Sep 12, 2013 at 1:56 PM, Matthew Taylor <m...@numenta.org> wrote: > Looks like Chetan is using the nupic-texts project as a data source for > his "linguist" project. > > > http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2013-August/001040.html > https://github.com/chetan51/linguist > > It is using a category encoder on each letter, and predicts the next > letter within a sequence. In order to predict words in sequence instead of > letters, I'm going to try to see how easy it will be to get word SDRs out > of the CEPT API and input into nupic instead of letters. I'm not sure how > much time I'll be able to spend on this, but if I get anything worthwhile, > I'll put it up on github. > > Has anyone done something like this with the CEPT API yet? > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <f.web...@cept.at>wrote: > >> James, >> This looks great! >> Yes the apostrophe tricked the parser … >> We could simply edit this in the source file and recompute the stats. In >> terms of punctuation we should just keep comma, full stop, question mark, >> exclamation mark. Semicolon should be changed into comma. >> Even if they might not appear in these texts its always good to make the >> code fail safe in this concern. >> Apostrophes and quotes are usually a mess. There must be something like >> 250 character codes in UTF-8 that produce some character that can behave >> like quotes… >> Best would be to replace them with blanks. Words that are in the reduced >> form like haven't should be taken as one word including the apostrophe. >> I will work out the next steps over the weekend and post my achievements >> on Monday. >> I also need to make some adaptions on the Retina for the CLA link-up. >> Looks like text 2 and 9 drop out the line a bit… maybe we should use them >> just for doing unseen text tests. As they have few exclusive words. I will >> give it another thought…. >> >> Thanks for your support. >> >> Francisco >> >> >> On 29.08.2013, at 05:05, James Tauber wrote: >> >> I pushed a Python 3 script to my repo that does a bunch of calculations. >> >> Here are the results of that script. Let me know what you'd like to see >> next. I can already see one problem in the tokenization where 'No was >> not split. >> >> FILENAME BYTES TOKEN TYPE >> ----------------------------------------------------- >> 01_the_ugly_duckling.txt 3143 782 207 >> 02_the_little_pine_tree.txt 1635 388 104 >> 03_the_little_match_girl.txt 3065 701 218 >> 04_little_red_riding_hood.txt 2168 509 159 >> 05_the_apples_of_idun.txt 3923 934 244 >> 06_how_thor_got_the_hammer.txt 5857 1373 318 >> 07_the_hammer_lost_and_found.txt 4260 1010 258 >> 08_the_story_of_the_sheep.txt 1265 304 129 >> 09_the_good_ship_argo.txt 889 209 107 >> 10_jason_and_the_harpies.txt 2187 495 173 >> 11_the_brass_bulls.txt 3487 786 239 >> 12_jason_and_the_dragon.txt 1867 427 180 >> ----------------------------------------------------- >> COLLECTION 33746 7918 882 >> >> Unique to 01_the_ugly_duckling.txt: >> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', >> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', >> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', >> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', >> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', >> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'} >> >> Unique to 02_the_little_pine_tree.txt: >> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night', >> 'pine', 'nor', 'glass', 'Again'} >> >> Unique to 03_the_little_match_girl.txt: >> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches', >> 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match', >> 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove', >> 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet', >> 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow', >> 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her', >> 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven', >> 'Gretchen', 'room', 'colder', 'candles', 'Christmas'} >> >> Unique to 04_little_red_riding_hood.txt: >> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter', >> 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's", >> 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red', >> 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'} >> >> Unique to 05_the_apples_of_idun.txt: >> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath', >> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut', >> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story', >> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed', >> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough', >> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'} >> >> Unique to 06_how_thor_got_the_hammer.txt: >> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket', >> 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs', >> "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did', >> 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok', >> 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin', >> 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse', >> 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs', >> 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning', >> "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked', >> 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody', >> 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'} >> >> Unique to 07_the_hammer_lost_and_found.txt: >> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight', >> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant', >> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace', >> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten', >> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put', >> "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja', >> 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others', >> 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried', >> 'Still', 'talked', 'mead', 'whirled', 'wagon'} >> >> Unique to 08_the_story_of_the_sheep.txt: >> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle', >> 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed', >> 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky', >> 'Every', 'tight'} >> >> Unique to 09_the_good_ship_argo.txt: >> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild', >> 'bridge', 'party', 'invited'} >> >> Unique to 10_jason_and_the_harpies.txt: >> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', >> 'drive', 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', >> 'hill', 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', >> 'together', 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'} >> >> Unique to 11_the_brass_bulls.txt: >> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', >> 'pushed', 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', >> 'place', 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', >> 'Brass', "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', >> 'sword', 'noon', 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', >> 'lie', 'heads', 'Early', 'larger', 'Nothing'} >> >> Unique to 12_jason_and_the_dragon.txt: >> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died', >> 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine', >> 'Dragon'} >> >> >> _______________________________________________ >> nupic mailing list >> nupic@lists.numenta.org >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> _______________________________________________ >> nupic mailing list >> nupic@lists.numenta.org >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > nupic@lists.numenta.org > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list nupic@lists.numenta.org http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org