Re: [Biojava-l] differences between read in sequence and stored sequence in database]

Gabrielle Doan Mon, 03 Nov 2008 06:50:20 -0800

Hi all,

I've changed the regular expression inorg.biojavax.bio.seq.io.GenbankFormat from


<code>
protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");
<\code>

to

<code>
protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}([A-Za-z]+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");
<\code>

like in BioRuby(http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb.diff?r1=0.24&r2=0.25&cvsroot=bioruby).But than features like D-loop can't be detected. So this is not thesolution for my problem.The reason for the truncation is readSection(BufferedReader br) inorg.biojavax.bio.seq.io.GenbankFormat.


<snip>

if (line==null || line.length()==0 || (!line.startsWith("") && linecount++>0)) {

                    // dump out last part of section
                    section.add(new String[]{currKey,currVal.toString()});
                    br.reset();
                    done = true;
<\snip>

The condition in the if-clause will ignore lines which don't begin witha whitespace, so this line will be read


<snip>

99999961 cccgcccaca cccctcggcc ctgccctctg gccatacagg ttctcggtggtgttgaagag

<\snip>

and this line won't be read:
<snip>
100000021 gtcctcgggc tccggcttgg tgctcacgca cacaggaaag tcagcttctc ctgggagggc
<\snip>

If you change the if-statement to this:

<snip>

String firstSecKey = section.size() == 0 ? "" :((String[])section.get(0))[0];

if (line==null || line.length()==0 || (!line.startsWith(" ") &&linecount++>0 && ( !firstSecKey.equals(START_SEQUENCE_TAG) ||line.startsWith(END_SEQUENCE_TAG))))

<\snip>

You can add the whole sequence without truncation to the database.

I have attached GenbankFormat.java in this mail. Can anybody check themethod for me and commit it? Since I'm not a BioJava specialist.


Cheers,
Gabrielle






Richard Holland schrieb:

Hello.

Sorry for the delayed reply - I've been away on business all week.

The similar Ruby issue (and solution) is discussed here:

http://portal.open-bio.org/pipermail/bioruby/2004-March.txt

How did you parse the files in the first place? Did you use the new
GenBank parsers (BJX), or the older ones? This will help indicate
where the problem lies - the data will have been truncated at the
point it was parsed from file, so the data in your database will
reflect this and you'll have to reload it once the appropriate parser
has been fixed.

If it was the newer BJX parser, then the problem most probably lies in
this regex from org.biojavax.bio.seq.io.GenbankFormat, which can
probably be fixed in a similar manner to the Ruby equivalent dicussed
in the posting above:

    protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");

Could someone volunteer to develop and test a fix? If you come up with
something, please commit it to the SVN trunk.

cheers,
Richard


2008/10/28 Gabrielle Doan <[EMAIL PROTECTED]>:

Hi all,
concering the problem as described below I have found out that this problem
also occured in BioRuby and was fixed in 2004.
See:
http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby
Unfortunately I'm clueless about BioRuby. Does anybody recognize this
problem or understand how it was solved in BioRuby?

I am grateful for any hints.

Cheers,

Gabrielle


-------- Original-Nachricht --------
Betreff: [Biojava-l] differences between read in sequence and stored
sequence in database
Datum: Mon, 27 Oct 2008 13:57:03 +0100
Von: Gabrielle Doan <[EMAIL PROTECTED]>
An: [EMAIL PROTECTED]

Hi all,

I have a BioSQL database which contains all human chromsomes. For my
recent project I have to query for a part of a sequence.
As far as I know I can get the whole sequence from the entry
Biosequence.Seq in the BioSQL schema. So I've made this query:

SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs;

But this query hasn't yield the desired string, because the length of
this biosequence is only 100,000,020 bp. I am very confused why I get
such a discrepancy. I have added all chromosomes with the build in
method in BioJava addRichSequence(RichSequence seq) to the database.
From my raw data I know that this sequence should have a length of
140,279,252 bp. So where is the remaining part of my sequence? I have
observed these discrepancies on all chromsomes which are longer than
100,000,020 bp.

Here is an abstract of my database:
bioentry_id     description     length
2       Homo sapiens mitochondrion, complete genome.    16571
3       Homo sapiens chromosome Y, reference assembly, complete sequence.
57772954
4       Homo sapiens chromosome X, reference assembly, complete sequence.
100000020
5       Homo sapiens chromosome 22, reference assembly, complete sequence.
49691432
6       Homo sapiens chromosome 21, reference assembly, complete sequence.
46944323
7       Homo sapiens chromosome 20, reference assembly, complete sequence.
25960004
8       Homo sapiens chromosome 9, reference assembly, complete sequence.
100000020
9       Homo sapiens chromosome 7, reference assembly, complete sequence.
100000020

Sequences smaller than 100,000,020 bp are correctly stored under
Biosequence.seq.

I am grateful for any hints, which explain the behaviour of my database.

Cheers,

Gabrielle
_______________________________________________
Biojava-l mailing list  -  Biojava-l@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

_______________________________________________
Biojava-l mailing list  -  Biojava-l@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


_______________________________________________
Biojava-l mailing list  -  Biojava-l@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] differences between read in sequence and stored sequence in database]

Reply via email to