Grammars and biological data formats

Fields, Christopher J Wed, 13 Aug 2014 01:19:25 -0700

I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.


My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

    Grammar.parsefile($file)

appears to be a convenient shorthand for:

    Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

    http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

    https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

    https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris

Grammars and biological data formats

Reply via email to