Re: [galaxy-user] Metagenomic filtering

Scott Tighe Wed, 16 Oct 2013 09:24:07 -0700

Jing

If I have a Galaxy dataset, do you think it is possible to develop apipeline that can


Megablast from Shotgun data for:

DNA gyrase only
Ribosomal ITSinternal transcribed spacer only
Cytochrome
RecA
Pol

As well as filter all model organisms?

Have you worked with the Galaxy Toolshed?

Scott


Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557

On 10/6/2013 9:59 AM, Jing Yu wrote:

Dear Scott,

I think what you propose is doable.

You may
1. use a 16s or gyrase DNA sequence as feeds to blast against yourdata to get the relative sequences,2. and then use the sequences as feeds to blast against yournucleotide database with appropriate filters.
There are several ways to make the steps. For example, you may alreadyhave the 16s sequence from assembly against a reference genome.And for Step 2, if you are not blasting thousands of times a day, andbelieve in the recent stability of NCBI, then a simple web_blast codewill do the trick. Otherwise, since the local blast+ toolkit doesn'tprovide the equivalent organism filters, you'll have to work a wit biton it:
Make a nucleotide database for Prokaryotes.
Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this isfor Escherichia as an example),
Send to 'File' -> Format ->GI List
When Blast, use this GI list as the value of this argument:-negative_gilist
Then parse the Blast result.
Most of these can be automated with some code, but I don't know how toincorporate it into Galaxy.
Regards,
Jing
On 4 Oct 2013, at 23:52, Scott Tighe <[email protected]<mailto:[email protected]>> wrote:
Dear Jing

What you have outlined below is perfect.
I wonder how hard it would be to design a few filters that only looka certain genes and or filter model organisms out of the dataset.
For example, say you want only data for 16s or only gyrase, but no/E.coli/ and no /Pseudomanas aeroginosa/
Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/25/2013 12:06 AM, Jing Yu wrote:
Hi Scott,

My first thought is:
1. Remove rDNA sequences (and/or other well known highly-conservedsequences to reduce the workload in step 2).2. Blast, then remove sequences with > (say 99%) match to > (say 5)genus. (Optional if step 1 is already good enough)
For step 1:
Build a fasta file of the chosen highly conserved sequences, anduse it as a feed to blast against your MiSeq result.
    Remove positive hits.
For step 2:
    Blast remaining MiSeq sequences against NCBI (or whatever) database.
    Remove if it hits more than n genus.

Jing
On 24 Sep 2013, at 22:17, Scott Tighe <[email protected]<mailto:[email protected]>> wrote:
Jing et al
Thank you for the offer to write some code to help advance themetagenomics arena. It is certainly needed.
So the problem is well known with megablast and shotgunmetagenomics and without proper understanding and correct softwarewill yield very misleading and in many cases incorrect data. Forthose of us who wish NOT to move to a protein level of comparisonfor specific reasons, we are stuck.
*The Problem:*
If I megablast 50 million sequences from a HiSeq run, millions ofrRNA sequences will have a 99% match to all microbes rRNA genbankdeposits. Not surprizing since the rRNA is highly conserved. Thedifference between E.coli and Shigella is 1 to 2 bases for the full1540 bp 16s. So 16s is not useful for Genus level, and certainlynot Species
*So what happens:*
The returned matches will have many hits to whatever model organismis in Genbank. For example E coli has 13000 entries for rRNA andSphearotilus has 3 entries for rRNA. If the blasted sequencematches both, the results will mislead the investigator to thinkthey have 13000 hits to E coli, EVEN if the microbe is Sphearotilus.
*The cure?:*
If there was a way to filter/ remove all hits ? Let say, forexample, that a result has a first match (say E. coli) at >99% asecond match (say Pseudomanas) at >99% and a third , forth andfifth match >99 for three other organisms. This sequence _must_ bediscarded because it is a conserve sequence.
Basically conserved sequence is the enemy and invalidates theentire result.
*
**Another problem:*
If you have a reference sample with 19 non-model microbes, and yourun that by HiSeq Shotgun for metagenomics and then megablast, whatdo you think you get? If E coli is not in the reference sample,how many hits do you think you get? Yes, 10,000 of thousands. Sowithout removing conserved sequences, your data is wrong and youare much better served by culturing and running a Biolog metabolicpanel and comparing to the sequence result.
So where do we start? I have some shotgun metagenomics data fromthe reference sample which included the 19 microbes. That was datafrom a MiSeq.
Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/20/2013 9:17 PM, Jing Yu wrote:
Hi Scott,
I can do some perl programming, such as local/remote blasting. Canyou specify your problem a little bit clearer, so that maybe I canwrite a program to do just that?
Regards,
Jing




Gerald
16s is basically useless for identification to genus. Since Istarted sequencing 16s in 1992, I have come to realize thatwithout sequencing the full 1540 bases, it is generallymisleading, and even than, it is not accurate enough to nail genuson more than 1/2 the cases. However, what is your feeling onITS and gyrase, They seem to be far more discriminating but thosedatabases have been decommissioned sometime ago.
The desirable thing would be that Galaxy or NCBI add a "filterconserved genes" [ ie any hit with a second choice greater than 3%distance]. Something such as that.
If you (or others) are aware of such a thing, I'd love the hereabout it.
Sincerely
Scott

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Metagenomic filtering

Reply via email to