On Wednesday 16 July 2008, Ricker, William wrote:
> Concordances are fun!

Concordances are not only fun, but can be surprisingly enlightening for such a 
simple device.

> Looks not unlike KWIC generator -- Key Word In Context indexing.
> Look up modules for that! (Or ACM Algorithms.)

My concordance program is lower-level than typical concordancers. It's a 
barebones, fast, simple way to get a compact view of a raw corpus. It doesn't 
have common niceties like part-of-speech filtering or silly convenience 
operators like "equals", "starts with", "ends with", "contains", etc. The 
former would require external resources (a dictionary or annotations of some 
sort), but one of the design goals was that it should be useful just with 
perl (and the power of regexes), the text itself, and nothing else. The 
latter would require making assumptions about how the text is tokenized. One 
can easily (or, well, sometimes not so easily...) craft regexes that roughly 
emulate most simple tokenization styles and at the same time match like the 
operators above (and almost any other imaginable one).

As I said, rough and quick. I do like the facilities offered by other 
concordancers/indexers, don't get me wrong, but they are different beasts.

>
> Uri's comments about fit of your suggestions with Ack are right on --
> your 1.2.3. ideas should be logged as Enhancement Requests on
> http://code.google.com/p/ack/issues/list .

I'll have to look at the code in more detail to see if it's reasonable to 
adapt Ack for that. In my mind Ack does indeed fall in the realm of a fast, 
pattern-based text search tool, with a well balanced set of options, rather 
than a utility that has accreted a billion barnacles of dubious utility and 
certain frustration.

> Works for even Novel sized files on today's machines, and anything
> larger will have been input in chapter or volume files.  Uri's buffering
> technique -- criticized by Charles for bypassing charset patching --
> would also solve this for you.

The other day I was more optimistic, but now I think any buffering technique 
will fail for arbitrary patterns as long as Perl's regex engine can't inform 
us whether, while trying to match, it hit the end of the buffer. That kind of 
notification would allow us to extend the buffer and retry the match, and 
thus avoid losing (longer) matches that would've been found had we been able 
to look past the current buffer end.

Now, if you make the buffer large enough, you can minimize such boundary cases 
to a usefully low fraction for most applications, but not eliminate them 
completely.

> How large is large?
> If you bunch small files and keep large files 
> single, it should be ok on any modern machine, unless you're running
> VISTA on a "Designed for Win NT" Pentium 1.

Hmm, reasonable people will make them tens of MB. But files in the hundreds of 
MB are not uncommon, and ocassionally, braindead GB files have "happened" to 
me, sigh.

> See Charles's discussion on-list, as Unicode compatibility of buffering
> impacts your intended use!

It'd be allright (but not optimal) if one could at least be sure that the text 
is utf8. And it looks like sysread has special provisions for that? 
>From "perldoc -f sysread" (for perl 5.10):

"Note that if the filehandle has been marked as ":utf8" Unicode characters are 
read instead of bytes (the LENGTH, OFFSET, and the return value of sysread() 
are in Unicode characters).  The ":encoding(...)" layer implicitly introduces 
the ":utf8" layer.  See "binmode", "open", and the "open" pragma, open."

So, OK, not exactly a panacea, but I guess one could preconvert to UTF8 if 
necessary, then apply the ':utf8' layer when reading the converted file. 
Annoying and probably a speed killer, but doable...

> I fear trying to do Char, Word, and Line context in one code will undo
> the optimization we're trying to do.

Currently, that's my feeling too. If, as I said before, the regex engine would 
tell us whether it reached the end of a string while attempting a match, the 
situation would be different.

Until that glorious day, perhaps having completely different code for each 
case would be a solution. What code to use could be an option to Ack. Rather 
than having a completely different set of options for each mode, some of the 
options (e.g., -A, -B, -C) would have different semantics, depending on which 
text 'unit' we are requested. -c would count... matches? not matching lines, 
I suppose. -i, --smart-case and -w could work the same.


Bernardo

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to