Re: Grammars and biological data formats

Darren Duncan Sat, 09 Aug 2014 15:34:23 -0700

I've already been thinking for awhile now that parsers need to be able tooperate in a streaming fashion (when the grammars lend themselves to it, by notneeding to lookahead, much if at all, to understand what they've already seen)so that strings that don't fit in memory all at once can be parsed.

Any parser that returns results piecewise to the caller rather than all at once,such as by supporting callbacks, already makes for a streaming interface on thatend, so it just needs to be lazy on the input end as well, and then one canparse arbitrary sized inputs while using little memory.


Christopher's example is a good one.

Another example that I would deal with is database dumps; the parsers in psql ormysql or others can obviously handle SQL dump files that are many gigabytes andare obviously parsing them in a streaming manner, but SQL files are really justprogram source code files.


-- Darren Duncan

On 2014-08-09, 3:09 PM, Fields, Christopher J wrote:

(accidentally sent to perl6-lang, apologies for cross-posting but this seems 
more appropriate)

I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

    Grammar.parsefile($file)

appears to be a convenient shorthand for:

    Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

    http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

    https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

    https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris

Re: Grammars and biological data formats

Reply via email to