OK so I got curious over the weekend and decided to do some testing.. The short of it is don't bother with trees! Without being able to run native code extensions that can create efficient nodes there is to much overhead involved with creating a new python object for each node.
I ran a test with a 4,336,347 word dictionary (word file from processing a Wikipedia dump that includes mixed casing, spelling mistakes and made up words like user names) and the python dictionary weighed in at 100,663,432 bytes including a 12 byte integer value for each key in the dictionary. Meanwhile I eventually had to terminate the process when I tried loading this using a simple Trie that used a python dictionary to hold the branches of each node. Then again assuming you probably don't need to keep case information a 50MB dictionary would load pretty fast and solve your lookups with minimal code. -regards Shane On Feb 5, 12:53 am, Norlesh <[email protected]> wrote: > Sofia a radix tree builds a tree of all the prefixes common between > words and then attaches a leaf node with the new suffix when the word > doesn't fit in the existing structure. The upshot of this is as your > word count increases the storage cost per word decreases. As for > actual performance numbers data is scarce but this Stack Overflow post > (http://goo.gl/TpPAK) mentions getting a trie with 1.1 million words > down to 139M (implemented in C#). > > My understanding is we have around a 280M limit on App Engine > instances and I don't know about Portuguese but according to Wikipedia > theres only around 600,000 words in the English language, besides you > could always just use it to cache however many of the most popular > words 60M or something would get you then fall back to the datastore > to check on less common/unprocessed words (if you did this then you > could fit even more words into a read only dictionary using a DAWG). > > Check out the Wikipedia articles on Radix Tree, Trie, Patricia Trie - > there all variations on the theme. Unfortunately you will probably > have to roll your own implementation since all the python libraries I > have seen use compiled code for performance, but there far from the > most scary thing to implement and if you have access to 'The Art of > Programming' theres a chapter on them in there. > -regards, Shane > > On Feb 4, 1:06 am, sofia <[email protected]> wrote: > > > @Nick I'm in Portugal so for now I don't have access to the prediction api. > > Maybe when there's worldwide access i'll switch but for now it's a no go - > > any estimates on when it will be opened to the rest of the world? :) > > > @Shane not sure how this could be implemented.. there could be > > thousands/millions of words classified which have to be checked against the > > new ones, so how would i load them all into memory in appengine without > > hitting cpu/memory limits? I'm not familiar with radix trees so i could be > > missing something. > > > @Albert good to know! -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
