Hi Scott,
There isn't a specific tool to do this filtering in one step, but tools
similar to those used in the in the Windshield analysis can be used again.
Starting with " Parse blast XML output" results (this tool is on the
Galaxy main server), calculate percent coverage (of the query) and
percent identify using " <http://main.g2.bx.psu.edu/root/tool_menu#>Text
Manipulation -> Compute" from the output. Then, once you have the query,
percent identify, and percent coverage, the data can be filtered any way
that you would like using tools in "Text Manipulation", "Filter and
Sort", and "Join, Subtract and Group".
You will likely want to start with a "Filter and Sort -> Select" step to
subset the data to be only those alignments that you consider part of
your conserved criteria (for example: >99% identity and >90% coverage).
On that result, count up the occurrence of each query identifier using
"Join, Subtract and Group -> Group". Next, use "Select" again to isolate
only those identifiers with the frequency (4?) that you choose as part
of your conserved criteria. This result will be your list of identifiers
for conserved sequences.
As a final step, remove all hits associated with these conserved
sequences from the original BLAST output. Using the tool "Join, Subtract
and Group -> Compare two Datasets", set dataset 1 to be the original
BLAST output and dataset 2 to be the list of conserved sequences (from
the above processing). The columns for both will be sequence
identifiers, and the option will be "To find:" -> "Non Matching rows of
1st dataset".
There are likely other ways to do this same procedure, and any process
that you work out could be put into a workflow for later use.
Hopefully this process work for you or leads you to a process that does
for your particular analysis. The tools in these groups can be combined
in many ways to produce unique manipulations.
Best wishes for your project,
Jen
Galaxy team
On 3/12/12 1:23 PM, Scott Tighe wrote:
Dear GALAXY and Jennifer
Although the windshield analysis papers were good starters, They do
not address conversed sequence purging or how to get at this
information. If anyone has an automated approach I'd be interested .
[Discard sequences from blast that have more then 4 hit >99%]
Scott
Scott Tighe
Advanced Genome Technology Lab
Vermont Cancer Center at the University of Vermont
149 Beaumont Avenue
Health Science Research Bd RM 305
Burlington Vermont USA 05405
lab 802-656-AGTC (2482)
cell 802-999-6666
On 3/12/2012 2:28 PM, John Major wrote:
A small warning re-the current cloud-Blast+ config.
To properly use the metagenomic tools, if you use the blast+ galaxy
tool, make sure to export in blast.XML, then you'll need a script to
parse out the readID and the Hit_def (as the hit ID). It appears
that the 'Hit_def' field contains the correct key to the taxonomy
database. Specifically, the Hit_def field is in the format #_#,
where the 'gi' id is the first #. The tabular (normal and extended)
data does not contain this info.
I noticed this after attempting to use the tabular data, and using a
trimmed col[1] (supposed to be hit seqID), but my results always came
back as a ranked list of the most sequenced genomes in nt....
basically keying in randomly.
j
On Wed, Mar 7, 2012 at 4:16 PM, Jennifer Jackson <j...@bx.psu.edu
<mailto:j...@bx.psu.edu>> wrote:
Hi Vincent, Scott,
Filtering raw hits is an important part of a metagenomics
analysis pipeline. Please see the methods described in the
published metagenomics analysis paper associated with this tool set:
Koskovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G, Chung W,
Taylor J, and Nekrutenko A. "Windshield splatter analysis with
the Galaxy metagenomic pipeline". Genome Research. 2009 Nov;
19(11):2144-53.
http://www.ncbi.nlm.nih.gov/pubmed/19819906
Live supplemental data that can be imported and experimented with
is available on the public instance, including raw data, working
histories, and a tutorial that demonstrates step-by-step the
exact methods used in the publication:
http://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter
http://main.g2.bx.psu.edu/library -> see "Windshield splatter"
Not all tools are available on the public main server, but a
local or cloud instance could be used with wrapped tools from the
Distribution or Tool Shed, as necessary. For example, BLAST is
not available on the public instance, but is included in the
distribution for use in local or cloud instances.
http://getgalaxy.org
Hopefully you will both find this helpful,
Jen
Galaxy project
On 2/29/12 5:32 PM, Montoya, Vincent wrote:
Hello
I am a relatively new user on Galaxy and I had a question
regarding "Fetching Taxonomic Information". It is great that
I can retrieve all of the hits for each sequence, but I
cannot seem to find an option to also provide how accurate of
a match it is to the given taxon. For instance, a percentage
match. I can access this information in the original file
and programmatically retrieve it but, it would be nice if it
came in one package so that I can avoide those false hits
that have a low percentage match. Can you please provide me
with instructions on how to best to retrieve this information
(hopefully in a single file)?
Thank you
Vincent
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org <http://usegalaxy.org>. Please keep all
replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/wiki/Support
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org <http://usegalaxy.org>. Please keep all replies
on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/