Re: [Boston.pm] ackathon design notes

Bernardo Rechea Tue, 15 Jul 2008 09:08:30 -0700

Hi, all,

I just barely digested all the ackaton mails. I think it's a brilliant idea. 
I've been using ack for a few months now, and it's a worthy piece of code to 
spend time improving.


This also hits a particular itch of mine, namely concordance generation from a 
corpus of text. In natural language processing, it's common to deal with 
large corpora (bodies of text, for those whose latin is a bit rusty), for 
training algorithms of many kinds. One of the first things one must do when 
dealing with new data is to eyeball it, but because it's typical for corpus 
sizes to be in the millions of documents, going file by file, while useful, 
gets one only so far. Concordances are a very compact view of the text, 
focused on a particular textual unit (word, phrase, string, whatever).

As an example, a few years ago I wrote a simple program that called like this:

    perl concordance.pl -w 80 -o 30 -re '(?i:palabra)' -fl 
gnews-es-small.filelist -af

produces something like this:

|de gobierno. Ni siquiera, en >>palabra<<s de Blanco, el PSOE augura un adela|
|Blanco hizo referencia a las >>palabra<<s que ayer pronunció Maragall en Bar|
|IV><STRONG>Zapatero niega la >>palabra<< a la Comunidad de Madrid</STRONG></|
|te del Gobierno ha negado la >>palabra<< a la Comunidad de Madrid. El vicepr|  
 
| todo, otra vez, ¶    fueron >>palabra<<s altisonantes, descalificaciones e |  
|;. Fueron estas las primeras >>palabra<<s de quien se erigi&#243; en ¶    ga|  
|ifiesto. Porque frente a las >>palabra<<s de De ¶    Madre o de Carod, que a|  

The 80 and 30 in the command line tell concordance.pl that we want 80 
characters around the match, and that the match should be centered slightly 
left of the middle of the context window (which would be at 40). 

I would love to ditch my concordance generator and use ack instead, but a 
couple of features would be necessary for that to happen:

1.  Ack's context is heavily line-oriented. Would it be too difficult to make 
the fixes generic enough to parameterize the type of context unit that is 
shown (i.e., not only lines, but also words or characters, etc.).

2.  Another issue with exclusive line-orientation is that ack can't search 
across line boundaries (I'm not 100% sure, so if I'm wrong, please correct 
me).

3.  When wading through a corpus of billions of words, even very specific 
patterns are likely to return too many hits to review by hand. A useful 
feature of concordance.pl is to do sampling. I.e., you can tell it "show me 
only 1 out of every 10 (or 100 or 1000) hits". This is useful to ensure that 
the search covers the corpus uniformly while reducing the data one has to 
eyeball.

There are other features that would be interesting to have in ack (such as 
custom output formats), but I think the ones above are the biggies.

My (not very good) solution to enabling search across line boundaries was to 
slurp the files whole sale. As a side effect, this also automatically makes 
for very fast pattern matching. On the other hand, slurping a file incurs the 
risk of running into a large file that will consume all your memory. In NLP 
it's common to have large multidocument files, mostly as a quick and dirty 
way to avoid some of the filesystem costs of opening many, many small files, 
so slurping files is more likely to run into memory issues.

I think the idea of managing a buffer is very promising, and will probably 
provide the best of both worlds: control of memory usage, and reasonable 
speed.

And, lastly, I see in my notes that doing word context is around 120x slower 
than doing character context (and I tried a bunch of ways of doing both), but 
it may very well be that there are better ways...

So there is some food for thought.

Bernardo



On Tuesday 15 July 2008, Uri Guttman wrote:
> hi all,
>
> here are some design and coding notes for my vision of a new core loop
> for ack. i divided it up into three parts, reading lines, matching and
> tracking, and printing results. i don't go over all the possible options
> as that is too detailed for this email. we will discuss them as we hack
> the code tomorrow.
>
> the current ack uses a <$fs> to read each line from the input. if you
> want raw speed in perl i/o you have to bypass its i/o system (whether it
> is perlio or the older stdio. i have done this in 2 cpan modules noted
> for their speed, File::Slurp and File::ReadBackwards. the latter has
> code that is very close to what we would want for the ack read line. it
> needs to be modified to only read forward (rip out all the seek stuff)
> but the guts are close to usable as is. the other issue is what api do
> we use inside the loop. a simple call to a readline is slow as it means
> a sub call per line. it doesn't matter how that sub is written, the call
> overhead adds up. so a better way is to manage a buffer of lines and get
> a line from it in the core loop if we can. else we call the sub to
> refill the buffer with lines. then the sub calls will be negligible but
> we need to code the conditon to get a line quickly.
>
> here is a sample code version of the buffer/readline stuff:
>
>       refill_lines_buffer( $fh ) if @lines <= 2 ;
>       my $line = shift @lines ;
>       last unless defined $line ;
>
> the reason for <= 2 is that if you have one 'line' left in the buffer it
> could be a partial line with the rest of it in the next block to be
> read. refill_lines_buffer will read blocks and split lines until it
> finds more than one line or eof (then the last line is full even if no
> trailing \n).
>
> here is the code from File::ReadBackwards that is the readbuffer stuff:
>
> # see if there is more than 1 line in the buffer
>
>               if ( @{$lines_ref} > 1 ) {
>
> # we have a complete line so return it
> # and convert those damned cr/lf lines to \n
>
>                       $lines_ref->[-1] =~ s/\015\012/\n/
>                                       if $self->{'is_crlf'} ;
>
>                       return( pop @{$lines_ref} ) ;
>               }
>
> # we don't have a complete, so have to read blocks until we do
>
>               my $seek_pos = $self->{'seek_pos'} ;
>
> # see if we are at the beginning of the file
>
>               if ( $seek_pos == 0 ) {
>
> # the last read never made more lines, so return the last line in the
> buffer # if no lines left then undef will be returned
> # and convert those damned cr/lf lines to \n
>
>                       $lines_ref->[-1] =~ s/\015\012/\n/
>                                       if @{$lines_ref} && $self->{'is_crlf'} ;
>
>                       return( pop @{$lines_ref} ) ;
>               }
>
> # we have to read more text so get the handle and the current read size
>
>               my $handle = $self->{'handle'} ;
>               my $read_size = $self->{'read_size'} ;
>
> # after the first read, always read the maximum size
>
>               $self->{'read_size'} = $max_read_size ;
>
> # seek to the beginning of this block and save the new seek position
>
>               $seek_pos -= $read_size ;
>               sysseek( $handle, $seek_pos, SEEK_SET ) ;
>               $self->{'seek_pos'} = $seek_pos ;
>
> # read in the next (previous) block of text
>
>               my $read_cnt = sysread( $handle, $read_buf, $read_size ) ;
>
> # prepend the read buffer to the leftover (possibly partial) line
>
>               my $text = $read_buf ;
>               $text .= shift @{$lines_ref} if @{$lines_ref} ;
>
> # split the buffer into a list of lines
> # this may want to be $/ but reading files backwards assumes plain text and
> # newline separators
>
>               @{$lines_ref} = ( $self->{'sep_is_regex'} ) ?
>                       $text =~ /(.*?$self->{'rec_sep'}|.+)/gs :
>                       $text =~ /(.*?\Q$self->{'rec_sep'}\E|.+)/gs ;
>
> #print "Lines \n=>", join( "<=\n=>", @{$lines_ref} ), "<=\n" ;
>
>       }
>
> much of that can be removed (backwards seeking stuff). and you can
> change the partial line merge stuff to go forward (append to
> $lines_ref[0] and not prepend!)
>
> also the separator stuff can be simplified assuming only text files will
> be acked. but that can be handled later.
>
>
> -------------------------
>
> the main loop of ack has to track two main types of data, the lines
> being read and buffered for context (the range of lines before/after a
> matched line) and the info about the matched lines (i call them hits)
> themselves. ack does this with a convoluted set of data and slow copying
> of the full lines. my idea is to use two shift registers (arrays with
> push and shift) to hold the lines and the hit info. another important
> trick is to only store references to the lines and move those
> around. that will save tons of ram and copying costs. also there is a
> need to handle prefixing and marking lines (highlighting) them where
> hits are found.
>
> the line context buffer will hold each a ref to each line as it comes in
> and shifts off lines when its count is greater than the context
> size. this is a shift register with a max size of the context size.
>
> the hit buffer will have a hit structure (hash ref) pushed onto it when
> the current line matches the regex. and for each line coming in, the
> first element in the hit buffer will be checked to see if it is ready to
> print (the 'after' number context lines have been read in). the hit
> contains stuff like the line number of the hit, the marked up line
> (highlight the match, etc.), the special hit prefix (the context lines
> get simpler prefixes), etc.
>
> now that i mentioned prefixes, this is a good time to cover them a
> bit. when a context is printed, each context line is marked with some
> form of prefix designated by command line options. this can be line
> numbers, file names, etc. a null prefix of a blank or : is possible i
> think. the matched line gets its own special prefix. my idea is to store
> context line prefixes with the line buffer. i don't want to prepend the
> prefixes as that will entail more copying of line text. instead the line
> buffer can hold a list of lists with each list having the prefix (or a
> ref to it?) and the ref to the line.
>
> how the hit structure tracks the after context count to determine when
> to print needs to be designed. it has several possible solutions and i
> haven't decided on one. we can talk about this at the meeting and pick
> one. it is easily changed if we don't like it.
>
> there are more options to deal with but that covers many of them and the
> rest can be cleanly bolted on to this set of structures. there are open
> detail issues and some possible design changes but this is a pretty good
> start for a fast and clean core loop.
>
> ------------------------
>
> printing a context
>
> this happens when a matched line (a hit) has been found and enough
> 'after' context lines have been read in. so this involves printing the
> line buffer but with one major difference. the hit line must be printed
> with a different prefix and the hit line printed with optional
> markup. one idea is to make those special hit strings in the hit
> structure. then replace the LOL for the hit line in the line buffer (save
> the original context line for reuse). now you can print the entire line
> buffer (dereferencing as needed) in one print call.
>
> -----------------
>
> hopefully the design and notes will allow us to redo the core loop and
> get most of it working. it is broken down into three parts and we can
> split up the work (including merging, testing, etc) into several
> teams. i expect to be sticking my large schnozz into each team. :)
>
> thanx,
>
> uri



_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] ackathon design notes

Reply via email to