VTD-XML should also be worth mentioning http://vtd-xml.sf.net
----- Original Message ----- From: "Mark Schreiber" <[EMAIL PROTECTED]> To: "Andy Yates" <[EMAIL PROTECTED]> Cc: "biojava-1 mailing list" <[email protected]> Sent: Thursday, November 29, 2007 6:28 PM Subject: Re: [Biojava-l] SAX, DOM, XPath and Flat files > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >> > Hi - >> > >> > I think in most cases huge XML files in bioinformatics result from a >> > single XML containing multiple repetitive elements. Eg a BLAST XML >> > output with several hits or a GenBankXML with many Sequences. A nice >> > approach I have seen for dealing with these is to use SAX to read over >> > the file and every time it comes to an element it delegates to a DOM >> > object. You then parse the bits of the DOM you want with XPath or >> > convert to objects or something and then when you are finished with >> > that entry everything gets garbage collected and the SAX parser moves >> > to the next element and repeats the whole process. This is a hybrid >> > of event based parsing and object-model based parsing which could let >> > you efficiently deal with huge files. >> > >> > I think the BLAST XML has improved substantially, at least in terms of >> > validating against it's own DTD. The DTD itself may not be the best >> > design but that is always a matter of taste and if you are using XPath >> > to get the relevant bits you don't need to make a SAX parser jump >> > through hoops to get them. >> > >> > I agree we will have to keep flat file parsers but we should strongly >> > encourage the use of XML where possible. It is simply easier to deal >> > with. Most biological flat-files were designed for Fortran and mainly >> > for human consumption. There is no obvious validation mechanism. >> > Notably everything in NCBI is derived from ASN.1, what you see in the >> > flatfile is produced from there. I tend to think this means that the >> > ASN.1 is the holy gospel and what you get in the flat file is some >> > translation. Ideally NCBI files should be parsed from the ASN.1 where >> > you can guarantee validation, the more practical alternative is to use >> > the XML which you can at least validate against a DTD. >> > >> > With XML we (Biojava) can say if it validates we will parse it and if >> > it doesn't we may not. With flat files there are so many dodgey >> > variants we cannot say anything. Because XML dtds (or xsd's) have >> > versions it also makes it much easier to have parsers for different >> > versions and the parsing machinery can figure out which is needed. >> > With flat files it is anyones guess what version you are dealing with. >> > >> > Finally parsers can be auto-generated for XML if you have the DTD or >> > XSD. This often doesn't give you an ideal parser but it can be a >> > useful starting point for rapid development. >> > >> > For Biojava v 3 I think we should concentrate on XML parsers first and >> > flat files second. <sigh>if only Fasta had an XML format</sigh> >> > >> > - Mark >> > >> > On Nov 27, 2007 11:16 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >> >> I was always under the impression that blast's XML output was nearly >> >> as >> >> hard to parse as the flat file format but I do agree that if we can >> >> use >> >> XML whenever we can it would make writing parsers a lot easier >> >> (especially if there are SAX based XPath libraries available). >> >> Actually >> >> this brings up a good question about development of this type of >> >> parser. >> >> The majority of XPath supporting libraries are DOM based which will >> >> mean >> >> large memory usage in some situations but overall providing an easier >> >> coding experience (and hopefully reduce our chances of creating bugs). >> >> Or should we code to the edge cases of someone trying to parse a 1GB >> >> XML? Personally I'd favour the former. >> >> >> >> Going back to the original topic there are going to be situations >> >> where >> >> people want the flat file parsers/writers & I think it's a valid point >> >> to say this is where BioJava is meant to come in & help a developer. >> >> Afterall XML is a computer science problem where as parsing an EMBL >> >> flat >> >> file or blast output is a bioinformatics problem. >> >> >> >> Andy >> >> >> >> >> >> Mark Schreiber wrote: >> >>> For a long time now my feeling has been that we should *only* support >> >>> the XML version of blast output. The other formats are too brittle >> >>> to >> >>> be easy to parse. I also feel similarly about Genbank, EMBL, etc >> >>> that >> >>> may be an extreme view but the power of generic XML parsers and >> >>> things >> >>> like XPath etc really make these formats look very attractive. >> >>> >> >>> - Mark >> >>> >> >>> >> >>> On Nov 27, 2007 7:47 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >> >>>> I think Groovy have adopted a similar system recently & have >> >>>> guidelines >> >>>> for how each module should behave (dependencies, build system etc). >> >>>> This >> >>>> enforces the idea that a module whilst not part of the core project >> >>>> must >> >>>> behave in the same manner the core does. I do like the idea that we >> >>>> can >> >>>> have a core biojava & things get added around it & it might >> >>>> encourage >> >>>> other users to start developing their own modules for any >> >>>> formats/purpose they want. >> >>>> >> >>>> Richard Holland wrote: >> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >> >>>>> Hash: SHA1 >> >>>>> >> >>>>>> What format options are there from blast? Just thinking if it >> >>>>>> supports >> >>>>>> CIGAR or something like that are we better providing a parser for >> >>>>>> that >> >>>>>> format & saying that we do not support the traditional blast >> >>>>>> output? >> >>>>>> That said it doesn't help is when that format changes so maybe >> >>>>>> what is >> >>>>>> needed is a way to push out parser changes without requiring a >> >>>>>> full >> >>>>>> biojava release (v3 discussion) ... >> >>>>> Exactly! So the modular idea would work nicely here - we could have >> >>>>> a >> >>>>> blast module and only update that single module (which would be its >> >>>>> own >> >>>>> JAR) whenever the format changes. In a way, BioJava releases as >> >>>>> such >> >>>>> would no longer happen, except maybe for some kind of core BioJava >> >>>>> module. Everything would be done in terms of individual module+JAR >> >>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, >> >>>>> one >> >>>>> for Phylogenetic tools, one for translation/transcription, etc. >> >>>>> etc. >> >>>>> >> >>>>> cheers, >> >>>>> Richard >> >>>> _______________________________________________ >> >>>> Biojava-l mailing list - [email protected] >> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >>>> >> > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
