On Wednesday 16 July 2008, Ricker, William wrote: > Concordances are fun!
Concordances are not only fun, but can be surprisingly enlightening for such a simple device. > Looks not unlike KWIC generator -- Key Word In Context indexing. > Look up modules for that! (Or ACM Algorithms.) My concordance program is lower-level than typical concordancers. It's a barebones, fast, simple way to get a compact view of a raw corpus. It doesn't have common niceties like part-of-speech filtering or silly convenience operators like "equals", "starts with", "ends with", "contains", etc. The former would require external resources (a dictionary or annotations of some sort), but one of the design goals was that it should be useful just with perl (and the power of regexes), the text itself, and nothing else. The latter would require making assumptions about how the text is tokenized. One can easily (or, well, sometimes not so easily...) craft regexes that roughly emulate most simple tokenization styles and at the same time match like the operators above (and almost any other imaginable one). As I said, rough and quick. I do like the facilities offered by other concordancers/indexers, don't get me wrong, but they are different beasts. > > Uri's comments about fit of your suggestions with Ack are right on -- > your 1.2.3. ideas should be logged as Enhancement Requests on > http://code.google.com/p/ack/issues/list . I'll have to look at the code in more detail to see if it's reasonable to adapt Ack for that. In my mind Ack does indeed fall in the realm of a fast, pattern-based text search tool, with a well balanced set of options, rather than a utility that has accreted a billion barnacles of dubious utility and certain frustration. > Works for even Novel sized files on today's machines, and anything > larger will have been input in chapter or volume files. Uri's buffering > technique -- criticized by Charles for bypassing charset patching -- > would also solve this for you. The other day I was more optimistic, but now I think any buffering technique will fail for arbitrary patterns as long as Perl's regex engine can't inform us whether, while trying to match, it hit the end of the buffer. That kind of notification would allow us to extend the buffer and retry the match, and thus avoid losing (longer) matches that would've been found had we been able to look past the current buffer end. Now, if you make the buffer large enough, you can minimize such boundary cases to a usefully low fraction for most applications, but not eliminate them completely. > How large is large? > If you bunch small files and keep large files > single, it should be ok on any modern machine, unless you're running > VISTA on a "Designed for Win NT" Pentium 1. Hmm, reasonable people will make them tens of MB. But files in the hundreds of MB are not uncommon, and ocassionally, braindead GB files have "happened" to me, sigh. > See Charles's discussion on-list, as Unicode compatibility of buffering > impacts your intended use! It'd be allright (but not optimal) if one could at least be sure that the text is utf8. And it looks like sysread has special provisions for that? >From "perldoc -f sysread" (for perl 5.10): "Note that if the filehandle has been marked as ":utf8" Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread() are in Unicode characters). The ":encoding(...)" layer implicitly introduces the ":utf8" layer. See "binmode", "open", and the "open" pragma, open." So, OK, not exactly a panacea, but I guess one could preconvert to UTF8 if necessary, then apply the ':utf8' layer when reading the converted file. Annoying and probably a speed killer, but doable... > I fear trying to do Char, Word, and Line context in one code will undo > the optimization we're trying to do. Currently, that's my feeling too. If, as I said before, the regex engine would tell us whether it reached the end of a string while attempting a match, the situation would be different. Until that glorious day, perhaps having completely different code for each case would be a solution. What code to use could be an option to Ack. Rather than having a completely different set of options for each mode, some of the options (e.g., -A, -B, -C) would have different semantics, depending on which text 'unit' we are requested. -c would count... matches? not matching lines, I suppose. -i, --smart-case and -w could work the same. Bernardo _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

