Re: [galaxy-user] MegaBLAST output

2012-04-25 Thread Sandrine Hughes
Dear all,

Sometimes ago, I’ve reported on this list the same problem with
megablast than Sarah mentioned. I finally used another way to analyse
my data but my conclusion was similar to Sarah one with most of the
time a shift of « -1 » between the GI number in the output and the
following parameters in the other columns.
One of the possible explanations was maybe a change in the GI number
according to the version of the databank used. However, I’m not so
sure that this explanation is the right one and I’m not so sure about
the consistency of this « GI-1 » rule.

Indeed, I’ve made a simple test by using a known and well identified
sequence taken at random in Genbank (GI 2924630 in NCBI, AB002412.1,
Elephas maximus mitochondrial DNA for cytochrome b, 1137 bp, so called
TEST-SEQ below). Using this sequence as template for megablast in
Galaxy gave the following results (3 first lines):

TEST-SEQ 2924736 1137 100.00 1137 0 0 1 1137 1 1137 0.0 2254.0
TEST-SEQ 155573765 16831 99.91 1137 1 0 1 1137 14149 15285 0.0 2246.0
TEST-SEQ 2924608 1137 99.74 1137 3 0 1 1137 1 1137 0.0 2230.0

As you can see, the parameters concerning the first match seem correct
: 1137 bp and 100% of identity. However, the GI number is not the good
one (2924736 instead of 2924630) and we don’t have a « GI-1 » rule
here. I used the Fetch Taxonomic Representation command available in
Galaxy to fetch the taxonomy for the 3 sequences above and 2 out 3 GI
numbers correspond to very distant taxa…
I did a megablast on the NCBI (so, the database is more recent) to
compare and here are the results :

TEST-SEQ gi|2924630|dbj|AB002412.1| 100.00 1137 0 0 1 1137 1 1137 0.0 2100
TEST-SEQ gi|37496447|emb|AJ428946.1| 99.91 1137 1 0 1 1137 14149 15285 0.0 2095
TEST-SEQ gi|2924606|dbj|D50844.1| 99.74 1137 3 0 1 1137 1 1137 0.0 2084

At this step, I would say that the table given in output of the
megablast in Galaxy is good for all parameters except for the GI of
the database hit (column 2)…and I'm not able to say why. I’m not sure
that it can help… Hoping that you will be able to clarify this
possible problem in the megablast output,
Best,

Sandrine


2012/4/24 Sarah Hicks garlicsc...@gmail.com:
 So, Jen, I'm not sure if we're talking about the same ID change... I
 am under the impression that GenBank does not change it's GI numbers
 for it's entries. Plus, it's now looking like all sequence length
 output info for each hit through Galaxy's megablast does not match to
 the GI number output given by Galaxy megablast, but to the GI number
 before it. Because the -1 rule is so consistent, it makes this seem
 less and less like it has to do with NCBI changing it's GI numbers to
 make room for new entries or something. In other words, there is a
 shift, as if a 1 was added to each NCBI GI number in galaxy before
 galaxy produces the output file.

 I need someone to tell me if I can trust the output. Basically I see
 it this way. Every hit row from Galaxy megablast actually has
 information for two NCBI entries: the one that shares the GI output
 and the one before it that shares the sequence length output. Which
 one is the hit I should be using?
 Because on some occations, the NCBI entry that shares the GI output
 from galaxy is VERY distantly related to the NCBI entry that shares
 the subject sequence length output from galaxy, and I don't know which
 to pick. Is this problem well understood, yet?


 On Tue, Apr 24, 2012 at 10:52 AM, Peter Cock p.j.a.c...@googlemail.com 
 wrote:
 On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicks garlicsc...@gmail.com wrote:
 Peter, you requested an example, here are the first five hits for my
 first query sequence (OTU#0)

 0       324034994       527     93.23   266     13      5       1       265 
     22      283     7e-102  379.0
 0       56181650        513     93.26   267     10      8       1       265 
     25      285     7e-102  379.0
 0       314913953       582     91.79   268     13      9       1       265 
     24      285     2e-92   347.0
 0       305670062       281     92.52   254     14      5       4       256 
     32      281     2e-92   347.0
 0       310814066       1180    91.73   266     14      7       1       265 
     24      282     9e-92   345.0

 You will notice there are 13 columns, one in addition to the 12 column
 titles you explained. This is because there is a column between sseqID
 and pident.

 I see now - the megablast_wrapper.py calls megablast (from the old legacy
 NCBI blast suite) which does indeed produce 12 column tabular output. But
 the wrapper script then edits the output:

 It appears to be splitting column 2 in two at the underscore intended to
 give the match ID and the length. This puzzles me but I haven't used
 the legacy BLAST tabular output for a while. On BLAST+ you can ask
 for the query or subject length explicitly as their own columns so we
 don't have this problem.

 The megablast_wrapper.py also re-formats the floating point score in the
 last column, apparently the 

Re: [galaxy-user] Editing workflows : Server error ?

2012-04-16 Thread Sandrine Hughes
Hi Jen,

Thanks a lot for your help. I will let you know if I have problems
after the update.

All the best,

Sandrine


2012/4/16 Jennifer Jackson j...@bx.psu.edu:
 Hi Sandrine,

 Thanks for sending the workflow link. A correction to this specific problem
 has been made (changeset 7038:1fdcce63a06f at
 https://bitbucket.org/galaxy/galaxy-central). This will reach the public
 main instance at the next update (likely within a week or so).

 The version of Galaxy running on the public instance can be found on the
 home page (http://usegalaxy.org) under the quickie scroll. Right now it is:
 Galaxy build: $Rev 6986:4527ed1eb175$.

 Please note that workflows can cause errors when they contain a tool/tool
 version that is not active on the Galaxy instance being used (this isn't
 new, but it seemed worth mentioning). This is an open development area for
 us. For now, use the available tool set to avoid the issue.

 We appreciate the feedback! Please let us know if you have problems after
 the update (any rev at or over 7038),


 Best,

 Jen
 Galaxy team

 On 4/13/12 10:50 AM, Sandrine Hughes wrote:

 Dear all,

 I didn't use Galaxy for some weeks now. When I tried to edit some of
 my workflows today I received an error message Server error instead.
 Is there any problem with the workflow editor or to edit big workflows
 (40 steps) recently ? It seems that I can only edit workflows with a
 limited number of steps.

 Thanks a lot,

 Sandrine
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/


 --
 Jennifer Jackson
 http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-user] GI error/shift in the output of megablast ?

2011-12-14 Thread Sandrine Hughes
Dear all,

I have a trouble with the Megablast program available in NGS Mapping
and I hope that you can help. Indeed, I think that there might be a
problem with the table given in output, and notably a shift between
the GI numbers and the parameters associated.

Here are the details:

I. First, what I have done :
   I used the program to identify the species that I have in a mix of
sequences by using the following options:
   Database nt 27-Jun-2011
   Word size 16
   Identity 90.0
   Cutoff 0.001
   Filter out low complexity regions Yes
   I run the analyses twice and obtained exactly the same results (I
used the online version of Galaxy, not a local one).

II. Second, I analysed the data obtained for one of my sequence
(1-202). The following lines are the beginning of the table that I
obtained after the megablast and two lines with troubles:

 1-202   312182292   484 99.33   150 1   0   1 150
1   150 2e-75   289.0
 1-202   312182201   476 99.33   150 1   0   1 150
1   150 2e-75   289.0
 1-202   308228725   928 99.33   150 1   0   1 150
19  168 2e-75   289.0
 1-202   308228711   938 99.33   150 1   0   1 150
22  171 2e-75   289.0
 1-202   308197083   459 99.33   150 1   0   1 150
10  159 2e-75   289.0
 1-202   300392378   920 99.33   150 1   0   1 150
10  159 2e-75   289.0
 1-202   300392376   918 99.33   150 1   0   1 150
9   158 2e-75   289.0
 1-202   300392375   922 99.33   150 1   0   1 150
11  160 2e-75   289.0
 1-202   300392374   931 99.33   150 1   0   1 150
21  170 2e-75   289.0
 1-202   300392373   909 99.33   150 1   0   1 150
21  170 2e-75   289.0
 1-202   300392371   117299.33   150 1   0   1 150
9   158 2e-75   289.0
...
1-202   179366399   151762  98.67   150 2   0   1 150
   46880   47029   6e-73   281.0
1-202   58617849511 98.67   150 2   0   1 150
   21  170 6e-73   281.0


III. Third, what I’ve noticed:
   My first problem was that among all the species identified, two
were very different from the expected ones (2 last lines). So I
decided to search if that could be possible for this sequence and
performed independently a megablast on the NCBI with similar options.
I was not able to find these two species in the results.
   So, I decided to check the hits identified in the table above and
identified a second problem. In the table, the second column give the
GI of the database hit and the third column give the length of the
database hit. However, when I manually checked in NCBI the length of
the GI, this one was incorrect. Indeed, for the GI 312182292, the
length should be 580 and not 484.
   By checking different lines, I noticed that the length that is
given for a GI corresponds to the length of the GI-1. As you can see
in the above table, some GI are consecutive (300392376,
300392375,...). When checking the length of 300392376 in NCBI, I
should have 920. But when I checked 300392375, I found 918. And this
was true for the following lines : 300392374 give normally 922 and
300392373 give 931... My conclusion at that point was that there is a
shift of –1 between the GI and the other parameters of the line
(indeed the parameters for the remaining columns are in agreement with
the length of the GI-1). However, that’s not always true For some
GI given in the table (for example, the two last lines), if we check
the parameters of the GI-1, the parameters are completely different...
So, I suppose that there is a problem in the GI sorting during the
megablast but I’m not able to clearly define the problem.

IV. Fourth, confirmed with an other dataset
   In order to be sure that the problem was not linked to my data or
my process, I asked a colleague to do a megablast on independent data.
The  conclusions were similar to mine : a shift in the GI given in the
table and the parameters associated on the same line, that most of the
time but not always, correspond to GI-1.

Can you confirm that there is a problem with the output of the
megablast available in Galaxy ? If yes, do you think there is a way to
fix it ?

Thanks a lot,

Sandrine

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  

[galaxy-user] GI error/shift in the output of megablast ?

2011-12-08 Thread Sandrine Hughes
Dear all,

I have a trouble with the Megablast program available in NGS Mapping
and I hope that you can help. Indeed, I think that there might be a
problem with the table given in output, and notably a shift between
the GI numbers and the parameters associated.

Here are the details:

I. First, what I have done :
I used the program to identify the species that I have in a mix of
sequences by using the following options:
Database nt 27-Jun-2011
Word size 16
Identity 90.0
Cutoff 0.001
Filter out low complexity regions Yes
I run the analyses twice and obtained exactly the same results (I
used the online version of Galaxy, not a local one).

II. Second, I analysed the data obtained for one of my sequence
(1-202). The following lines are the beginning of the table that I
obtained after the megablast and two lines with troubles:

 1-202   312182292   484 99.33   150 1   0   1
  150 1   150 2e-75   289.0
 1-202   312182201   476 99.33   150 1   0   1
  150 1   150 2e-75   289.0
 1-202   308228725   928 99.33   150 1   0   1
  150 19  168 2e-75   289.0
 1-202   308228711   938 99.33   150 1   0   1
  150 22  171 2e-75   289.0
 1-202   308197083   459 99.33   150 1   0   1
  150 10  159 2e-75   289.0
 1-202   300392378   920 99.33   150 1   0   1
  150 10  159 2e-75   289.0
 1-202   300392376   918 99.33   150 1   0   1
  150 9   158 2e-75   289.0
 1-202   300392375   922 99.33   150 1   0   1
  150 11  160 2e-75   289.0
 1-202   300392374   931 99.33   150 1   0   1
  150 21  170 2e-75   289.0
 1-202   300392373   909 99.33   150 1   0   1
  150 21  170 2e-75   289.0
 1-202   300392371   117299.33   150 1   0   1
  150 9   158 2e-75   289.0
...
1-202   179366399   151762  98.67   150 2   0   1
 150 46880   47029   6e-73   281.0
1-202   58617849511 98.67   150 2   0   1
 150 21  170 6e-73   281.0


III. Third, what I’ve noticed:
My first problem was that among all the species identified, two
were very different from the expected ones (2 last lines). So I
decided to search if that could be possible for this sequence and
performed independently a megablast on the NCBI with similar options.
I was not able to find these two species in the results.
So, I decided to check the hits identified in the table above and
identified a second problem. In the table, the second column give the
GI of the database hit and the third column give the length of the
database hit. However, when I manually checked in NCBI the length of
the GI, this one was incorrect. Indeed, for the GI 312182292, the
length should be 580 and not 484.
By checking different lines, I noticed that the length that is
given for a GI corresponds to the length of the GI-1. As you can see
in the above table, some GI are consecutive (300392376,
300392375,...). When checking the length of 300392376 in NCBI, I
should have 920. But when I checked 300392375, I found 918. And this
was true for the following lines : 300392374 give normally 922 and
300392373 give 931... My conclusion at that point was that there is a
shift of –1 between the GI and the other parameters of the line
(indeed the parameters for the remaining columns are in agreement with
the length of the GI-1). However, that’s not always true For some
GI given in the table (for example, the two last lines), if we check
the parameters of the GI-1, the parameters are completely different...
So, I suppose that there is a problem in the GI sorting during the
megablast but I’m not able to clearly define the problem.

IV. Fourth, confirmed with an other dataset
In order to be sure that the problem was not linked to my data or
my process, I asked a colleague to do a megablast on independent data.
The  conclusions were similar to mine : a shift in the GI given in the
table and the parameters associated on the same line, that most of the
time but not always, correspond to GI-1.

Can you confirm that there is a problem with the output of the
megablast available in Galaxy ? If yes, do you think there is a way to
fix it ?

Thanks a lot,

Sandrine

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please