Re: [Boston.pm] ackathon design notes

Uri Guttman Tue, 15 Jul 2008 10:39:12 -0700

>>>>> "BR" == Bernardo Rechea <[EMAIL PROTECTED]> writes:


  BR> 1.  Ack's context is heavily line-oriented. Would it be too difficult to 
make 
  BR> the fixes generic enough to parameterize the type of context unit that is 
  BR> shown (i.e., not only lines, but also words or characters, etc.).

  BR> 2.  Another issue with exclusive line-orientation is that ack
  BR> can't search across line boundaries (I'm not 100% sure, so if I'm
  BR> wrong, please correct me).

ack is a grep replacement and it has always been line oriented. the
problem with character contexts is how do you track the context when you
are reading lines. maybe this can be handled in the print_context sub
given the lines buffer as input? but i don't want to address functional
changes here. that is andy's area as he owns ack. let's focus on the
core loop rewrite and if we do it well, then we can try to add features
like you want.

  BR> 3.  When wading through a corpus of billions of words, even very
  BR> specific patterns are likely to return too many hits to review by
  BR> hand. A useful feature of concordance.pl is to do sampling. I.e.,
  BR> you can tell it "show me only 1 out of every 10 (or 100 or 1000)
  BR> hits". This is useful to ensure that the search covers the corpus
  BR> uniformly while reducing the data one has to eyeball.

that one seems easy. just add a decimating counter to the context
printer. print 1 of N hits. but again do it later.

  BR> There are other features that would be interesting to have in ack
  BR> (such as custom output formats), but I think the ones above are
  BR> the biggies.

please address these to andy or hold off until we finish this project
first. 

  BR> My (not very good) solution to enabling search across line
  BR> boundaries was to slurp the files whole sale. As a side effect,
  BR> this also automatically makes for very fast pattern matching. On
  BR> the other hand, slurping a file incurs the risk of running into a
  BR> large file that will consume all your memory. In NLP it's common
  BR> to have large multidocument files, mostly as a quick and dirty way
  BR> to avoid some of the filesystem costs of opening many, many small
  BR> files, so slurping files is more likely to run into memory issues.

slurping is good (hey, i like it) but grepping and slurping are not
compatible since grep can handle log and other huge files but slurping can't.

  BR> And, lastly, I see in my notes that doing word context is around
  BR> 120x slower than doing character context (and I tried a bunch of
  BR> ways of doing both), but it may very well be that there are better
  BR> ways...

dunno as i don't have the code. let's get this done first and see what
we can do later. 

thanx,

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] ackathon design notes

Reply via email to