Dear all,

Sometimes ago, I’ve reported on this list the same problem with
megablast than Sarah mentioned. I finally used another way to analyse
my data but my conclusion was similar to Sarah one with most of the
time a shift of « -1 » between the GI number in the output and the
following parameters in the other columns.
One of the possible explanations was maybe a change in the GI number
according to the version of the databank used. However, I’m not so
sure that this explanation is the right one and I’m not so sure about
the consistency of this « GI-1 » rule.

Indeed, I’ve made a simple test by using a known and well identified
sequence taken at random in Genbank (GI 2924630 in NCBI, AB002412.1,
Elephas maximus mitochondrial DNA for cytochrome b, 1137 bp, so called
TEST-SEQ below). Using this sequence as template for megablast in
Galaxy gave the following results (3 first lines):

TEST-SEQ 2924736 1137 100.00 1137 0 0 1 1137 1 1137 0.0 2254.0
TEST-SEQ 155573765 16831 99.91 1137 1 0 1 1137 14149 15285 0.0 2246.0
TEST-SEQ 2924608 1137 99.74 1137 3 0 1 1137 1 1137 0.0 2230.0

As you can see, the parameters concerning the first match seem correct
: 1137 bp and 100% of identity. However, the GI number is not the good
one (2924736 instead of 2924630) and we don’t have a « GI-1 » rule
here. I used the Fetch Taxonomic Representation command available in
Galaxy to fetch the taxonomy for the 3 sequences above and 2 out 3 GI
numbers correspond to very distant taxa…
I did a megablast on the NCBI (so, the database is more recent) to
compare and here are the results :

TEST-SEQ gi|2924630|dbj|AB002412.1| 100.00 1137 0 0 1 1137 1 1137 0.0 2100
TEST-SEQ gi|37496447|emb|AJ428946.1| 99.91 1137 1 0 1 1137 14149 15285 0.0 2095
TEST-SEQ gi|2924606|dbj|D50844.1| 99.74 1137 3 0 1 1137 1 1137 0.0 2084

At this step, I would say that the table given in output of the
megablast in Galaxy is good for all parameters except for the GI of
the database hit (column 2)…and I'm not able to say why. I’m not sure
that it can help… Hoping that you will be able to clarify this
possible problem in the megablast output,
Best,

Sandrine


2012/4/24 Sarah Hicks <garlicsc...@gmail.com>:
> So, Jen, I'm not sure if we're talking about the same ID change... I
> am under the impression that GenBank does not change it's GI numbers
> for it's entries. Plus, it's now looking like all sequence length
> output info for each hit through Galaxy's megablast does not match to
> the GI number output given by Galaxy megablast, but to the GI number
> before it. Because the "-1" rule is so consistent, it makes this seem
> less and less like it has to do with NCBI changing it's GI numbers to
> make room for new entries or something. In other words, there is a
> shift, as if a 1 was added to each NCBI GI number in galaxy before
> galaxy produces the output file.
>
> I need someone to tell me if I can trust the output. Basically I see
> it this way. Every hit row from Galaxy megablast actually has
> information for two NCBI entries: the one that shares the GI output
> and the one before it that shares the sequence length output. Which
> one is the hit I should be using?
> Because on some occations, the NCBI entry that shares the GI output
> from galaxy is VERY distantly related to the NCBI entry that shares
> the subject sequence length output from galaxy, and I don't know which
> to pick. Is this problem well understood, yet?
>
>
> On Tue, Apr 24, 2012 at 10:52 AM, Peter Cock <p.j.a.c...@googlemail.com> 
> wrote:
>> On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicks <garlicsc...@gmail.com> wrote:
>>> Peter, you requested an example, here are the first five hits for my
>>> first query sequence (OTU#0)
>>>
>>> 0       324034994       527     93.23   266     13      5       1       265 
>>>     22      283     7e-102  379.0
>>> 0       56181650        513     93.26   267     10      8       1       265 
>>>     25      285     7e-102  379.0
>>> 0       314913953       582     91.79   268     13      9       1       265 
>>>     24      285     2e-92   347.0
>>> 0       305670062       281     92.52   254     14      5       4       256 
>>>     32      281     2e-92   347.0
>>> 0       310814066       1180    91.73   266     14      7       1       265 
>>>     24      282     9e-92   345.0
>>>
>>> You will notice there are 13 columns, one in addition to the 12 column
>>> titles you explained. This is because there is a column between sseqID
>>> and pident.
>>
>> I see now - the megablast_wrapper.py calls megablast (from the old legacy
>> NCBI blast suite) which does indeed produce 12 column tabular output. But
>> the wrapper script then edits the output:
>>
>> It appears to be splitting column 2 in two at the underscore intended to
>> give the match ID and the length. This puzzles me but I haven't used
>> the legacy BLAST tabular output for a while. On BLAST+ you can ask
>> for the query or subject length explicitly as their own columns so we
>> don't have this problem.
>>
>> The megablast_wrapper.py also re-formats the floating point score in the
>> last column, apparently the NCBI style could cause problems with the
>> Galaxy filter tool.
>>
>>> In the metagenomic tutorial the first 4 columns are
>>> explained, and column 3 is described as length of sequence in database
>>> (or length of the subject sequence).
>>>
>>> This is the problem column. The length of only one of the subject GI
>>> numbers above match the subject length in NCBI. This has caused me to
>>> wonder if I can trust the hit info. In all cases that I've checked,
>>> when this happens the correct match is the listed GI value minus 1
>>> (ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
>>> long).
>>
>> That is strange.
>>
>> Peter
>
> ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to