Concordances are fun! > This also hits a particular itch of mine, namely concordance generation
Looks not unlike KWIC generator -- Key Word In Context indexing. Look up modules for that! (Or ACM Algorithms.) Uri's comments about fit of your suggestions with Ack are right on -- your 1.2.3. ideas should be logged as Enhancement Requests on http://code.google.com/p/ack/issues/list . > My (not very good) solution to enabling search across line boundaries > was to slurp the files whole sale. > slurping a file incurs the > risk of running into a large file that will consume all your memory. Works for even Novel sized files on today's machines, and anything larger will have been input in chapter or volume files. Uri's buffering technique -- criticized by Charles for bypassing charset patching -- would also solve this for you. > In NLP >it's common to have large multidocument files, mostly as a quick and dirty > way to avoid some of the filesystem costs of opening many, many small files, > so slurping files is more likely to run into memory issues. How large is large? If you bunch small files and keep large files single, it should be ok on any modern machine, unless you're running VISTA on a "Designed for Win NT" Pentium 1. > I think the idea of managing a buffer is very promising, See Charles's discussion on-list, as Unicode compatibility of buffering impacts your intended use! > And, lastly, I see in my notes that doing word context is around 120x slower > than doing character context (and I tried a bunch of ways of doing both), but > it may very well be that there are better ways... I fear trying to do Char, Word, and Line context in one code will undo the optimization we're trying to do. [EMAIL PROTECTED] @ $DayJob _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

