Hi, I'm about to rework an old program, and I am pondering what data structure to use. Basically, I want to store an index of fixed-length words (q-grams) with positions, i.e. given a word, I can quickly find its position in the data.
The data set may be large (gigabytes would be nice), and I also want the option of storing a sparse set of the words in the data (i.e. every other word, every third, etc). I *think* I want to use a Map from words to positions, words packed to fit in Int(32|64|eger) (depending on word size), and positions as Int(32|64). I've used a similar approach with FiniteMaps previously, and it worked nicely, but for smaller data sets. However: Using a 'Map key [pos]' will eat a lot of space for the lists. 'Map key (UArray Int pos)' will be more efficient, but every insertion and deletion will destroy and create a new array -- and I would like the Map to live outside a monad (does there even exist an (IO|ST)Map?) I've considered lists of small(ish -- cache line compatible, probably) arrays. Another possibility is a suffix array, and a Map from words to regions in it. How will a hashtable work in this setting? Any thoughts? -kzm PS: it seems from a recent thread in comp.lang.functional (see <[EMAIL PROTECTED]>) that array performance has improved in GHC6.4. Impressively so. -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users