[ragel-users] Ruby buffer code for streaming scanner

Seamus Abshere Mon, 13 Jun 2011 08:53:47 -0700

hi,

The Ragel Guide has an excellent set of guidelines for how to "take onsome buffer management functions" when using the longest-match operator(for scanners):

\begin{itemize}
\setlength{\parskip}{0pt}
\item Read a block of input data.
\item Run the execute code.
\item If \verb|ts| is set, the execute code will expect the incomplete
token to be preserved ahead of the buffer on the next invocation of the execute
code.
\begin{itemize}
\item Shift the data beginning at \verb|ts| and ending at \verb|pe| to the
beginning of the input buffer.
\item Reset \verb|ts| to the beginning of the buffer.
\item Shift \verb|te| by the distance from the old value of \verb|ts|
to the new value. The \verb|te| variable may or may not be valid.  There is
no way to know if it holds a meaningful value because it is not kept at null
when it is not in use. It can be shifted regardless.
\end{itemize}
\item Read another block of data into the buffer, immediately following any
preserved data.
\item Run the scanner on the new data.
\end{itemize}

I believe this is a correct implementation in Ruby: (see the #scan!method for the buffering)

=begin
%%{
  machine foo_scanner;

  foo_open = 'START_FOO';
  foo_close = 'STOP_FOO';
  foo = foo_open any* :>> foo_close;

  main := |*
    foo => { emit data[ts...te].pack('c*') };
    any;
  *|;
}%%
=end

class FooScanner
  # read stuff in 1 meg at a time
  CHUNK_SIZE = 1_048_576

  attr_reader :target

  def initialize(target)
    @target = target
    %% write data;
  end

  def emit(foo_entity)
    puts "I found a foo entity!"
    puts foo_entity
  end

  def scan!
    # Set pe so that ragel doesn't try to get it from data.length
    pe = -1
    eof = File.size(target)

    %% write init;

    prefix = []
    File.open(target) do |f|
      while chunk = f.read(CHUNK_SIZE)
        # \item Read a block of input data.
        data = prefix + chunk.unpack("c*")

        # \item Run the execute code.
        p = 0
        pe = data.length
        %% write exec;

        # \item If \verb|ts| is set, the execute code will expect the 
incomplete token to be preserved ahead of the buffer on the next invocation of 
the execute code.
        unless ts.nil?
          # \begin{itemize}
          # \item Shift the data beginning at \verb|ts| and ending at \verb|pe| 
to the beginning of the input buffer.
          prefix = data[ts..pe]
          # \item Shift \verb|te| by the distance from the old value of 
\verb|ts| to the new value. The \verb|te| variable may or may not be valid.  
There is no way to know if it holds a meaningful value because it is not kept 
at null when it is not in use. It can be shifted regardless. [SWAPPED ORDER]
          if te
            te = te - ts
          end
          # \item Reset \verb|ts| to the beginning of the buffer. [SWAPPED 
ORDER]
          ts = 0
          # \end{itemize}
        else
          prefix = []
        end
        # \item Read another block of data into the buffer, immediately 
following any preserved data.
        # \item Run the scanner on the new data.
      end
    end
  end
end


You can run it with

foo_scanner = FooScanner.new 'foo.txt'
foo_scanner.scan!

If that is good code, then perhaps it could be added as another exampleto the Ragel website?


Thanks,
Seamus

--
Seamus Abshere
123 N Blount St Apt 403
Madison, WI 53703
1 (201) 566-0130

_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users

[ragel-users] Ruby buffer code for streaming scanner

Reply via email to