There seems to be lots of ways to think about contigs. One nice way is a Markov Chain
(although that is more of a consensus). An alternative is to treat the contig as a
collection of sequences and some associated information about the locations of the
sequences in the contig and what the consensus should look like. We do this with an
XML description of the contig.
I feel that all the parts needed are in biojava and it would be good to have a fairly
abstract Contig object that holds the information required. When the needed sequenceDB
is available then a view could be made to the consensus (a Sequence object) or a view
to the Alignment or even a view to a Markov Chain. When quality info is available a
Sequence over the Phred alphabet could be produced. In this way a Contig object is not
a Sequence an Alignment or a Markov chain but information in it could be used to
produce all three.
Anyone want to code that up :)
- Mark
-----Original Message-----
From: Greg Cox [mailto:[EMAIL PROTECTED]
Sent: Tue 8/07/2003 3:19 a.m.
To: Matthew Pocock
Cc: biojava-l
Subject: RE: [Biojava-l] Re: genbank contig stuff
We looked at this a while back, and I suspect this isn't a problem BioJava can
solve.
If we treat it as a sequence, one option is try to assemble it. If BioJava
assembles the sequence, it has to know where to get the composing sequences. This
implies some sort of database backing to parse the contig sequences, which seems a bit
excessive. If all you want is the features, we could create a dummy sequence of
ambiguous nucleotides of the proper length, and attach the features to that. At that
point though, I think it makes more sense to create a feature holder instead of
pretending it's a real sequence. Which segues into...
The other option is to treat a contig as a new kind of beast, not a sequence.
I don't know what this beast would look like; it has to be a feature holder, probably
annotatable, and then what? Aesthetically I'm not sure this makes sense either, after
all, a contig sequence is still a sequence.
The ray of light is that most (all?) contigs are avilable in an expanded form
also. That's been enough for us to avoid grappling with this bull so far.
Greg
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Matthew Pocock
Sent: Thursday, June 26, 2003 2:58 PM
To: Matthew Pocock
Cc: biojava-l
Subject: [Biojava-l] Re: genbank contig stuff
Sory - I fired that off without thinking much.
I just downloaded the genbank file NT_010783 from the ncbi. Our parsers
spewed lots of errors about features not being within the range 1..0,
and after a little poking arround in the code, I found that a zero
length sequence was being generated. In despiration, I looked at the
physical genbank file. Instead of sequences, it contains a CONTIG
section with a single big join() describing how to build it from other
entries.
Has anybody modified our genbank parser to process entries like this? To
be honest, I'm not quite sure where to start.
Matthew
_______________________________________________
Biojava-l mailing list - [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________
Biojava-l mailing list - [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
_______________________________________________
Biojava-l mailing list - [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l