Grammars and biological data formats

2014-08-09 Thread Fields, Christopher J
(accidentally sent to perl6-lang, apologies for cross-posting but this seems 
more appropriate)

I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.  

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

   Grammar.parsefile($file)

appears to be a convenient shorthand for:

   Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

   http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

   https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

   https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris



Re: Grammars and biological data formats

2014-08-09 Thread timo
(accidentally sent this privately only, now re-sending to the list)

Hello Christopher,

In the Perl 6 specification, there are plans for lazy and
memory-releasing ways to parse strings that are either too large to fit
into memory at once or that are generated lazily (like being streamed in
through the network or using live data sources). Sadly, none of those
features are implemented in either of our backends.

The simplest thing we have is the cut rule, which should instruct the
grammar engine to deallocate the parts of the input data that are before
the current cursor. Sadly, this is not going to help you much at this stage.

Another thing that will be unhelpful is that our lazy lists (such as the
ones you can generate with gather/take or what lines() will give you)
will keep all items from the very first to the last you've requested
around until the whole list becomes garbage and gets collected.

It would seem like you'll want to do a line-by-line iteration through
the data using not lines() but get() and manually parse the individual
lines; the grammar seems sufficiently simple for that to work.

Something that does surprise me is that your tests seem to imply that :p
for subparse doesn't work. I'll look into that, because I believe it
ought to be implemented already. Perhaps not properly hooked up, though.

Hope to help!
- Timo




Re: Grammars and biological data formats

2014-08-09 Thread timo

On 08/10/2014 12:21 AM, t...@wakelift.de wrote:
 Something that does surprise me is that your tests seem to imply that :p
 for subparse doesn't work. I'll look into that, because I believe it
 ought to be implemented already. Perhaps not properly hooked up, though.

On #perl6 I got corrected quite quickly: subparse is anchored to the
start and end of the target string, so :pos doesn't make sense. In this
case, you want just .parse

Another thing is that if lines() does keep all data around, it should be
considered a bug, as we should be able to infer that we don't keep the
list itself around and thus won't be able to refer to its previous
values later on. Thus, we should free the memory for the earlier lines
in the target string after the loop is done with them.

I have not yet tested, if this is the case, though.

Hope that clears up a bit of potential confusion before it can arise
  - Timo



Re: Grammars and biological data formats

2014-08-09 Thread Darren Duncan
I've already been thinking for awhile now that parsers need to be able to 
operate in a streaming fashion (when the grammars lend themselves to it, by not 
needing to lookahead, much if at all, to understand what they've already seen) 
so that strings that don't fit in memory all at once can be parsed.


Any parser that returns results piecewise to the caller rather than all at once, 
such as by supporting callbacks, already makes for a streaming interface on that 
end, so it just needs to be lazy on the input end as well, and then one can 
parse arbitrary sized inputs while using little memory.


Christopher's example is a good one.

Another example that I would deal with is database dumps; the parsers in psql or 
mysql or others can obviously handle SQL dump files that are many gigabytes and 
are obviously parsing them in a streaming manner, but SQL files are really just 
program source code files.


-- Darren Duncan

On 2014-08-09, 3:09 PM, Fields, Christopher J wrote:

(accidentally sent to perl6-lang, apologies for cross-posting but this seems 
more appropriate)

I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

Grammar.parsefile($file)

appears to be a convenient shorthand for:

Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris






Re: Grammars and biological data formats

2014-08-09 Thread Fields, Christopher J

 On Aug 9, 2014, at 5:25 PM, t...@wakelift.de t...@wakelift.de wrote:
 
 
 On 08/10/2014 12:21 AM, t...@wakelift.de wrote:
 Something that does surprise me is that your tests seem to imply that :p
 for subparse doesn't work. I'll look into that, because I believe it
 ought to be implemented already. Perhaps not properly hooked up, though.
 
 On #perl6 I got corrected quite quickly: subparse is anchored to the
 start and end of the target string, so :pos doesn't make sense. In this
 case, you want just .parse

I mainly tested subparse() to see if it would find the second FASTA record 
(which works if using :p and not :pos).

Sorry, I should have updated that, but subparse() with :p works fine; the spec 
mentions :pos though (I plan on submitting a pull request on that).

 Another thing is that if lines() does keep all data around, it should be
 considered a bug, as we should be able to infer that we don't keep the
 list itself around and thus won't be able to refer to its previous
 values later on. Thus, we should free the memory for the earlier lines
 in the target string after the loop is done with them.
 
 I have not yet tested, if this is the case, though.
 
 Hope that clears up a bit of potential confusion before it can arise
  - Timo

I can try that out.

Chris