I think I've seen XPath hanging around in other people's code in a 1.5 code-base (in fact one of the guys I work with). I've used Java's DOM before & it really isn't very nice & quite verbose. I'd prefer if there was a better alternative/wrapper around the XML parsers just to cut down on code chatter.
Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these Java tools & I think I've gone cross-eyed with the sheer number of acronyms! You've gotta love something which seems to add a letter to ER & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the list know of a ASN.1 parser for Java that's good and should we support it (considering NCBI generate their DTD & XML from the ASN.1 representation). Andy Mark Schreiber wrote: > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but > not XQuery although XPath is probably more important for this use. > > The DOM model is a direct implementation of the W3C standard which > makes it a little awkward from a java point of view but it is usable. > > Java 6 has StAX (the other one). > > There are a few java API's for parsing ASN.1 mostly developed for the > telco industry, I've never really looked into which is best (anyone > experienced with this?) but we could probably use one to work directly > off NCBI ASN.1 > > - Mark > > On Nov 28, 2007 10:29 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >> Hi Mark, >> >> Okay that sounds like a perfectly sensible way to deal with this. Is >> this kind of parsing model supported in Java5? I only ask as I've not >> done a lot of XML parsing with Java5; more with things like XOM (which I >> think offers a DOM only representation but I'm probably wrong). >> >> That's good. There's not a huge point to have a format & a DTD/XSD and >> then have your files not conform to it. >> >> I was thinking the exact same thing about ASN.1 (well that & it looks >> bleeding horrible to parse but that is an un-educated look at the format >> which I'm sure is a parsable as JSON & the alike). >> >> When it comes to flat file parsers I would be happier to provide >> implementations of the more common formats where a viable alternative is >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide >> similar output to the above have a chance to write their own >> parsers/formatters. This is very similar to the current situation but we >> just need to remove dependencies on statically located data structures >> (don't get rid of them completely just give users an option to not use >> them). >> >> I'm not sure how much automatically generated parsers would help us. I >> guess it depends on the data model(s) we use if they are auto-parser >> friendly (which normally means POJO/JavaBean conventions including the >> no-args constructor). >> >> Cool I don't want to exclude flat file parsers completely (if only >> because my group has an interest in BioJava being able to read & write >> flat files) :) >> >> They decided to have HUPO-PSI Format instead :) >> >> Andy >> >> >> Mark Schreiber wrote: >>> Hi - >>> >>> I think in most cases huge XML files in bioinformatics result from a >>> single XML containing multiple repetitive elements. Eg a BLAST XML >>> output with several hits or a GenBankXML with many Sequences. A nice >>> approach I have seen for dealing with these is to use SAX to read over >>> the file and every time it comes to an element it delegates to a DOM >>> object. You then parse the bits of the DOM you want with XPath or >>> convert to objects or something and then when you are finished with >>> that entry everything gets garbage collected and the SAX parser moves >>> to the next element and repeats the whole process. This is a hybrid >>> of event based parsing and object-model based parsing which could let >>> you efficiently deal with huge files. >>> >>> I think the BLAST XML has improved substantially, at least in terms of >>> validating against it's own DTD. The DTD itself may not be the best >>> design but that is always a matter of taste and if you are using XPath >>> to get the relevant bits you don't need to make a SAX parser jump >>> through hoops to get them. >>> >>> I agree we will have to keep flat file parsers but we should strongly >>> encourage the use of XML where possible. It is simply easier to deal >>> with. Most biological flat-files were designed for Fortran and mainly >>> for human consumption. There is no obvious validation mechanism. >>> Notably everything in NCBI is derived from ASN.1, what you see in the >>> flatfile is produced from there. I tend to think this means that the >>> ASN.1 is the holy gospel and what you get in the flat file is some >>> translation. Ideally NCBI files should be parsed from the ASN.1 where >>> you can guarantee validation, the more practical alternative is to use >>> the XML which you can at least validate against a DTD. >>> >>> With XML we (Biojava) can say if it validates we will parse it and if >>> it doesn't we may not. With flat files there are so many dodgey >>> variants we cannot say anything. Because XML dtds (or xsd's) have >>> versions it also makes it much easier to have parsers for different >>> versions and the parsing machinery can figure out which is needed. >>> With flat files it is anyones guess what version you are dealing with. >>> >>> Finally parsers can be auto-generated for XML if you have the DTD or >>> XSD. This often doesn't give you an ideal parser but it can be a >>> useful starting point for rapid development. >>> >>> For Biojava v 3 I think we should concentrate on XML parsers first and >>> flat files second. <sigh>if only Fasta had an XML format</sigh> >>> >>> - Mark >>> >>> On Nov 27, 2007 11:16 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >>>> I was always under the impression that blast's XML output was nearly as >>>> hard to parse as the flat file format but I do agree that if we can use >>>> XML whenever we can it would make writing parsers a lot easier >>>> (especially if there are SAX based XPath libraries available). Actually >>>> this brings up a good question about development of this type of parser. >>>> The majority of XPath supporting libraries are DOM based which will mean >>>> large memory usage in some situations but overall providing an easier >>>> coding experience (and hopefully reduce our chances of creating bugs). >>>> Or should we code to the edge cases of someone trying to parse a 1GB >>>> XML? Personally I'd favour the former. >>>> >>>> Going back to the original topic there are going to be situations where >>>> people want the flat file parsers/writers & I think it's a valid point >>>> to say this is where BioJava is meant to come in & help a developer. >>>> Afterall XML is a computer science problem where as parsing an EMBL flat >>>> file or blast output is a bioinformatics problem. >>>> >>>> Andy >>>> >>>> >>>> Mark Schreiber wrote: >>>>> For a long time now my feeling has been that we should *only* support >>>>> the XML version of blast output. The other formats are too brittle to >>>>> be easy to parse. I also feel similarly about Genbank, EMBL, etc that >>>>> may be an extreme view but the power of generic XML parsers and things >>>>> like XPath etc really make these formats look very attractive. >>>>> >>>>> - Mark >>>>> >>>>> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates <[EMAIL PROTECTED]> wrote: >>>>>> I think Groovy have adopted a similar system recently & have guidelines >>>>>> for how each module should behave (dependencies, build system etc). This >>>>>> enforces the idea that a module whilst not part of the core project must >>>>>> behave in the same manner the core does. I do like the idea that we can >>>>>> have a core biojava & things get added around it & it might encourage >>>>>> other users to start developing their own modules for any >>>>>> formats/purpose they want. >>>>>> >>>>>> Richard Holland wrote: >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> Hash: SHA1 >>>>>>> >>>>>>>> What format options are there from blast? Just thinking if it supports >>>>>>>> CIGAR or something like that are we better providing a parser for that >>>>>>>> format & saying that we do not support the traditional blast output? >>>>>>>> That said it doesn't help is when that format changes so maybe what is >>>>>>>> needed is a way to push out parser changes without requiring a full >>>>>>>> biojava release (v3 discussion) ... >>>>>>> Exactly! So the modular idea would work nicely here - we could have a >>>>>>> blast module and only update that single module (which would be its own >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such >>>>>>> would no longer happen, except maybe for some kind of core BioJava >>>>>>> module. Everything would be done in terms of individual module+JAR >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one >>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - [email protected] >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
