>>>>> "BR" == Bernardo Rechea <[EMAIL PROTECTED]> writes:
BR> 1. Ack's context is heavily line-oriented. Would it be too difficult to make BR> the fixes generic enough to parameterize the type of context unit that is BR> shown (i.e., not only lines, but also words or characters, etc.). BR> 2. Another issue with exclusive line-orientation is that ack BR> can't search across line boundaries (I'm not 100% sure, so if I'm BR> wrong, please correct me). ack is a grep replacement and it has always been line oriented. the problem with character contexts is how do you track the context when you are reading lines. maybe this can be handled in the print_context sub given the lines buffer as input? but i don't want to address functional changes here. that is andy's area as he owns ack. let's focus on the core loop rewrite and if we do it well, then we can try to add features like you want. BR> 3. When wading through a corpus of billions of words, even very BR> specific patterns are likely to return too many hits to review by BR> hand. A useful feature of concordance.pl is to do sampling. I.e., BR> you can tell it "show me only 1 out of every 10 (or 100 or 1000) BR> hits". This is useful to ensure that the search covers the corpus BR> uniformly while reducing the data one has to eyeball. that one seems easy. just add a decimating counter to the context printer. print 1 of N hits. but again do it later. BR> There are other features that would be interesting to have in ack BR> (such as custom output formats), but I think the ones above are BR> the biggies. please address these to andy or hold off until we finish this project first. BR> My (not very good) solution to enabling search across line BR> boundaries was to slurp the files whole sale. As a side effect, BR> this also automatically makes for very fast pattern matching. On BR> the other hand, slurping a file incurs the risk of running into a BR> large file that will consume all your memory. In NLP it's common BR> to have large multidocument files, mostly as a quick and dirty way BR> to avoid some of the filesystem costs of opening many, many small BR> files, so slurping files is more likely to run into memory issues. slurping is good (hey, i like it) but grepping and slurping are not compatible since grep can handle log and other huge files but slurping can't. BR> And, lastly, I see in my notes that doing word context is around BR> 120x slower than doing character context (and I tried a bunch of BR> ways of doing both), but it may very well be that there are better BR> ways... dunno as i don't have the code. let's get this done first and see what we can do later. thanx, uri -- Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Free Perl Training --- http://perlhunter.com/college.html --------- --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com --------- _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

