> (I'm interested in XML processing as well -
> also large files, though not for bio stuff)
> can you show a test case (actual source code,
> XML input data, and your performance measurements)?
Probably - the data file I used is a bit large (eight gigs), so probably
not ideal to ship around as a test case.
> what is meant by "the parsing is lazy" exactly?
I don't know, did I use that term?
> You want a BlastResult with a lazy list of results
> (containing BlastRecords with a lazy list of hits, etc)?
No - that is the case now, but I generally just discard the top
BlastResult "node", and extract the results -- as a lazy list.
> but you still want to accept valid files only?
I can live with getting an error message after partial processing. The
XML is machine generated, so any error is an upstream software error -
to be fixed, and then the whole thing must be run again.
And tagsoup is lenient, I don't think it cares much about validity or
even well-formedness.
One thing that might work, would be to replace the hierarchical
structure:
BlastResult {...
results :: [BlastRecord]
}
BlastRecord {query, etc...
hits :: [BlastHit]
}
BlastHit {target, etc...
matches :: [BlastMatch]
}
BlatMatch { position etc }
with a flat one, e.g.:
BlatFlat { query, target, position etc... }
This means you will repeat lots of information in subsequent records,
but would probably avoid the spikes in memory use, and certainly avoid
the lists of sub-elements.
-k
--
If I haven't seen further, it is by standing in the footprints of giants