You are probably looking for a SAX parser,
http://en.wikipedia.org/wiki/Simple_API_for_XML I've got my own hard coded c++ that I use for my string processing rules source code, FDA AERA SGML parsing, SOAP utilities, etc, that will output all the fields in a simple format of "label value" per line, but there are SAX libraries in just about every language. Personally I finally gave up on PERL as speed, at least under cygwin, was unpredictable and degraded quickly when you ran out of physical memory. Mike Marchywka 586 Saint James Walk Marietta GA 30067-7165 415-264-8477 (w)<- use this 404-788-1216 (C)<- leave message 989-348-4796 (P)<- emergency only [EMAIL PROTECTED] Note: If I am asking for free stuff, I normally use for hobby/non-profit information but may use in investment forums, public and private. Please indicate any concerns if applicable. Note: hotmail is getting cumbersom, try also [EMAIL PROTECTED] > Date: Thu, 4 Sep 2008 15:29:20 +0100 > From: [EMAIL PROTECTED] > To: [email protected] > Subject: [BiO BB] Parsing GenBank XML? > > Hi, > > Dumb / noob question I am sure but... I am parsing the results of a > GenBank query obtained using esearch / efetch: > > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html > > > The XML looks like this... > > http://pastebin.com/f3ef02d85 > > the only difference being that the real document has (possibly) > millions of 's. > > I decided to try to use XSLT to turn the XML into tabular output. This > is working fine on a sample of the data. I get one row of data per > Seq-entry, which is exactly what I want. For reference, my XSLT style > sheet is here: > > http://pastebin.com/f3a512411 > > > I am not sure how efficient that XSLT is (I never used XSLT before), > however, that isn't the real problem. The real problem is that the > XSLT parsers that I have tried (xsltproc and XML::XSLT) both need to > slurp up the whole XML document before they output any rows of text. > This is way too memory intensive, especially as the data my well grow. > > I figure that I can't be the first person to parse GenBank, so I was > wondering what is 'out there' in terms of community consensus on how > to do it... > > I had a quick go with XML::Simple, but I rapidly get lost in the > resulting data structure, which I find leads to very messy (hard to > read / write) and generally unmaintainable code. > > Are the various 'BioX' modules any good? i.e. do they simplify the > resulting data to make it easy to get tab delimited dumps of the data? > > > Cheers, > > Dan. > > > -- > http://network.nature.com/profile/dan > > _______________________________________________ > BBB mailing list > [email protected] > http://www.bioinformatics.org/mailman/listinfo/bbb _________________________________________________________________ Stay up to date on your PC, the Web, and your mobile phone with Windows Live. http://clk.atdmt.com/MRT/go/msnnkwxp1020093185mrt/direct/01/ _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
