Hi, Dumb / noob question I am sure but... I am parsing the results of a GenBank query obtained using esearch / efetch:
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html The XML looks like this... http://pastebin.com/f3ef02d85 the only difference being that the real document has (possibly) millions of <Seq-entry>'s. I decided to try to use XSLT to turn the XML into tabular output. This is working fine on a sample of the data. I get one row of data per Seq-entry, which is exactly what I want. For reference, my XSLT style sheet is here: http://pastebin.com/f3a512411 I am not sure how efficient that XSLT is (I never used XSLT before), however, that isn't the real problem. The real problem is that the XSLT parsers that I have tried (xsltproc and XML::XSLT) both need to slurp up the whole XML document before they output any rows of text. This is way too memory intensive, especially as the data my well grow. I figure that I can't be the first person to parse GenBank, so I was wondering what is 'out there' in terms of community consensus on how to do it... I had a quick go with XML::Simple, but I rapidly get lost in the resulting data structure, which I find leads to very messy (hard to read / write) and generally unmaintainable code. Are the various 'BioX' modules any good? i.e. do they simplify the resulting data to make it easy to get tab delimited dumps of the data? Cheers, Dan. -- http://network.nature.com/profile/dan _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
