[Boston.pm] Concordance RE: ackathon design notes

Ricker, William Wed, 16 Jul 2008 08:45:53 -0700

Concordances are fun!

> This also hits a particular itch of mine, namely concordance
generation


Looks not unlike KWIC generator -- Key Word In Context indexing. 
Look up modules for that! (Or ACM Algorithms.)

Uri's comments about fit of your suggestions with Ack are right on --
your 1.2.3. ideas should be logged as Enhancement Requests on
http://code.google.com/p/ack/issues/list .


> My (not very good) solution to enabling search across line boundaries 
> was to slurp the files whole sale. 
> slurping a file incurs the 
> risk of running into a large file that will consume all your memory. 

Works for even Novel sized files on today's machines, and anything
larger will have been input in chapter or volume files.  Uri's buffering
technique -- criticized by Charles for bypassing charset patching --
would also solve this for you.

> In NLP >it's common to have large multidocument files, mostly as a
quick and dirty 
> way to avoid some of the filesystem costs of opening many, many small
files, 
> so slurping files is more likely to run into memory issues.

How large is large?  If you bunch small files and keep large files
single, it should be ok on any modern machine, unless you're running
VISTA on a "Designed for Win NT" Pentium 1.

> I think the idea of managing a buffer is very promising, 

See Charles's discussion on-list, as Unicode compatibility of buffering
impacts your intended use!

> And, lastly, I see in my notes that doing word context is around 120x
slower 
> than doing character context (and I tried a bunch of ways of doing
both), but 
> it may very well be that there are better ways...

I fear trying to do Char, Word, and Line context in one code will undo
the optimization we're trying to do.


[EMAIL PROTECTED] @ $DayJob

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

[Boston.pm] Concordance RE: ackathon design notes

Reply via email to