Re: GSoC - 6th coding week

Ketil Malde Fri, 27 Jun 2014 06:22:31 -0700

> (I'm interested in XML processing as well -
> also large files, though not for bio stuff)


> can you show a test case (actual source code,
> XML input data, and your performance measurements)?

Probably - the data file I used is a bit large (eight gigs), so probably
not ideal to ship around as a test case.

> what is meant by "the parsing is lazy" exactly?

I don't know, did I use that term?

> You want a BlastResult with a lazy list of results
> (containing BlastRecords with a lazy list of hits, etc)?

No - that is the case now, but I generally just discard the top
BlastResult "node", and extract the results -- as a lazy list.

> but you still want to accept valid files only?

I can live with getting an error message after partial processing.  The
XML is machine generated, so any error is an upstream software error -
to be fixed, and then the whole thing must be run again.

And tagsoup is lenient, I don't think it cares much about validity or
even well-formedness.

One thing that might work, would be to replace the hierarchical
structure:

  BlastResult {...
     results :: [BlastRecord]
  }
  BlastRecord {query, etc...
     hits :: [BlastHit]
  }
  BlastHit {target, etc...
     matches :: [BlastMatch]
  }
  BlatMatch { position etc }

with a flat one, e.g.:

  BlatFlat { query, target, position etc... }

This means you will repeat lots of information in subsequent records,
but would probably avoid the spikes in memory use, and certainly avoid
the lists of sub-elements.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

Re: GSoC - 6th coding week

Reply via email to