Re: [galaxy-user] Metagenomic filtering

Jing Yu Sun, 06 Oct 2013 22:15:00 -0700

Dear Scott,

I think what you propose is doable.


You may 
1. use a 16s or gyrase DNA sequence as feeds to blast against your data to get 
the relative sequences, 
2. and then use the sequences as feeds to blast against your nucleotide 
database with appropriate filters.

There are several ways to make the steps. For example, you may already have the 
16s sequence from assembly against a reference genome.
And for Step 2, if you are not blasting thousands of times a day, and believe 
in the recent stability of NCBI, then a simple web_blast code will do the 
trick. Otherwise, since the local blast+ toolkit doesn't provide the equivalent 
organism filters, you'll have to work a wit bit on it:

Make a nucleotide database for Prokaryotes.
Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this is for 
Escherichia as an example),
Send to 'File' -> Format ->GI List
When Blast, use this GI list as the value of this argument: -negative_gilist
Then parse the Blast result.

Most of these can be automated with some code, but I don't know how to 
incorporate it into Galaxy. 

Regards,
Jing
On 4 Oct 2013, at 23:52, Scott Tighe <scott.ti...@uvm.edu> wrote:

> Dear Jing
> 
> What you have outlined below is perfect. 
> 
> I wonder how hard it would be to design a few filters that only look a 
> certain genes and or filter model organisms out of the dataset. 
> 
> For example, say you want only data for 16s or only gyrase, but no E.coli and 
> no Pseudomanas aeroginosa
> 
> Scott
> Scott Tighe
> Senior Core Laboratory Research Staff
> Advanced Genome Technologies Core
> University of Vermont
> Vermont Cancer Center
> 149 Beaumont ave
> Health Science Research Facility 303/305
> Burlington Vermont 05405
> 802-656-2557
> On 9/25/2013 12:06 AM, Jing Yu wrote:
>> Hi Scott,
>> 
>> My first thought is:
>> 
>> 1. Remove rDNA sequences (and/or other well known highly-conserved sequences 
>> to reduce the workload in step 2).
>> 2. Blast, then remove sequences with > (say 99%) match to > (say 5) genus. 
>> (Optional if step 1 is already good enough)
>> 
>> 
>> For step 1:
>>     Build a fasta file of the chosen highly conserved sequences, and use it 
>> as a feed to blast against your MiSeq result.
>>     Remove positive hits.
>> For step 2:
>>     Blast remaining MiSeq sequences against NCBI (or whatever) database.
>>     Remove if it hits more than n genus.
>> 
>> Jing
>> On 24 Sep 2013, at 22:17, Scott Tighe <scott.ti...@uvm.edu> wrote:
>> 
>>> Jing et al 
>>> 
>>> Thank you for the offer to write some code to help advance the metagenomics 
>>> arena. It is certainly needed.
>>> 
>>> So the problem is well known with megablast and shotgun metagenomics and 
>>> without proper understanding and correct software will yield very 
>>> misleading and in many cases incorrect data. For those of us who wish NOT 
>>> to move to a protein level of comparison for specific reasons, we are 
>>> stuck. 
>>> 
>>> The Problem:
>>> 
>>> If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
>>> sequences will have a 99% match to all microbes rRNA genbank deposits. Not 
>>> surprizing since the rRNA is highly conserved. The difference between 
>>> E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s.  So 16s is 
>>> not useful for Genus level, and certainly not Species
>>> 
>>> So what happens:
>>> 
>>> The returned matches will have many hits to whatever model organism is in 
>>> Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 
>>> 3 entries for rRNA. If the blasted sequence matches both, the results will 
>>> mislead the investigator to think they have 13000 hits to E coli, EVEN if 
>>> the microbe is Sphearotilus. 
>>> 
>>> The cure?:
>>> 
>>> If there was a way to filter/ remove all hits ? Let say, for example, that 
>>> a result has a first match (say E. coli) at >99% a second match (say 
>>> Pseudomanas) at >99% and a third , forth and fifth match >99 for three 
>>> other organisms. This sequence must be discarded because it is a conserve 
>>> sequence.
>>> 
>>> Basically conserved sequence is the enemy and invalidates the entire 
>>> result. 
>>> 
>>> Another problem:
>>> 
>>> If you have a reference sample with 19 non-model  microbes, and you run 
>>> that by HiSeq Shotgun for metagenomics and then megablast, what do you 
>>> think you get?  If E coli is not in the reference sample, how many hits do 
>>> you think you get? Yes, 10,000 of thousands. So without removing conserved 
>>> sequences, your data is wrong and you are much better served by culturing 
>>> and running a Biolog metabolic panel and comparing to the sequence result.  
>>> 
>>> So where do we start? I have some shotgun metagenomics data from the 
>>> reference sample which included the 19 microbes. That was data from a MiSeq.
>>> 
>>> Scott
>>> Scott Tighe
>>> Senior Core Laboratory Research Staff
>>> Advanced Genome Technologies Core
>>> University of Vermont
>>> Vermont Cancer Center
>>> 149 Beaumont ave
>>> Health Science Research Facility 303/305
>>> Burlington Vermont 05405
>>> 802-656-2557
>>> On 9/20/2013 9:17 PM, Jing Yu wrote:
>>>> Hi Scott,
>>>> 
>>>> I can do some perl programming, such as local/remote blasting. Can you 
>>>> specify your problem a little bit clearer, so that maybe I can write a 
>>>> program to do just that?
>>>> 
>>>> Regards,
>>>> Jing
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Gerald
>>>> 
>>>>  16s is basically useless for identification to genus. Since I started 
>>>> sequencing 16s in 1992, I have come to realize that without sequencing the 
>>>>  full 1540 bases, it is generally  misleading, and even than, it is not 
>>>> accurate enough to nail genus on more than 1/2 the cases.   However, what 
>>>> is your feeling on ITS  and gyrase, They seem to be far more 
>>>> discriminating but those databases have been decommissioned sometime ago.
>>>> 
>>>> The desirable thing would be that Galaxy or NCBI  add a "filter conserved 
>>>> genes" [ ie any hit with a second choice greater than 3% distance]. 
>>>> Something such as that.
>>>> 
>>>> If you (or others)  are aware of such a thing, I'd love the here about it.
>>>> 
>>>> Sincerely 
>>>> Scott
>>> 
>> 
>

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Metagenomic filtering

Reply via email to