Re: [galaxy-user] FASTQ collapse?

Ben Bimber Thu, 03 Feb 2011 11:35:30 -0800

i'm not terribly familiar with what galaxy offers along these lines,
but google 'fastqc' or picard tools for a simple way to find the sort
of quality distributions you just mentioned.  it would take a little
work on your end, but you could answer that questions.  the former
actually uses picard tools behind the scenes, but is a little more
graphically oriented.


-ben



On Thu, Feb 3, 2011 at 1:23 PM, Johnson, Kory (NIH/NINDS) [C]
<[email protected]> wrote:
> Hi Ben,
>
> Thanks for your reply!
>
> Yes, perhaps taking average, median, max quality value by position across 
> duplicate reads, using sliding window function, or simply randomly selecting 
> one quality line from those having the highest quality summed value is the 
> way to go.
>
> The complexity of problem I would think can be greatly reduced if "FASTQ 
> collapse" is done post FASTQ filtering by quality and/or length.
>
> It would also be an interesting question to ask::answer what the numbers look 
> like post filtering as far as duplicate sequences having the same quality 
> string vs differences by position vs base composition.
>
> Perhaps a FASTQ collapse tool could be developed in such a way to not only 
> remove duplicates and replace with a representative quality line, but also be 
> used as a way to perform a FASTQ filtering polish step based on discordance 
> rates by position across duplicates.
>
> Such that, a user can look at the discordance rates via box plot as you 
> would/can for quality scores observed across all reads, and pick a refined 
> criteria for filtering.
>
> Just an idea.
>
> Thanks again for taking the time to respond Ben.
>
> Best,
>
> Kory
>
> --------------------------------------------
>
> Kory R. Johnson, MS, PhD
> Sr. Bioinformatics Scientist
>
>
>
> www.kellygovernmentsolutions.com
>
> Providing Contract Services For:
>
> Bioinformatics Section,
> Information Technology & Bioinformatics Program,
> Division of Intramural Research (DIR),
> National Institute of Neurological Disorders & Stroke (NINDS),
> National Institutes of Health (NIH),
> Bethesda, Maryland
>
> Mailing Address:
>
> NINDS/NIH
> Clinical Center (Building 10)
> Office 5S223
> 9000 Rockville Pike
> Bethesda, MD 20892
>
> Contact Information:
>
> Phone:    301-402-1956
> Fax:           301-480-3563
> email:       [email protected]
>
>  Green Message:
>
> Please consider the environment before printing this e-mail.  Thank you.
>
> Important Message:
>
> This electronic message transmission contains information intended for the 
> recipient only.  Such that, the information contained herein may be 
> confidential, privaledged, or proprietary.  If you are not the intended 
> recipient, be aware that any disclosure, copying, distribution, or use of 
> this information is strictly prohibited.  If you have received this 
> electronic information in error, please notify the sender immediately by 
> telephone.  Thank you.
>
> --------------------------------------------
>
>
> -----Original Message-----
> From: Ben Bimber [mailto:[email protected]]
> Sent: Thursday, February 03, 2011 1:37 PM
> To: Johnson, Kory (NIH/NINDS) [C]; [email protected]
> Subject: Re: [galaxy-user] FASTQ collapse?
>
> i dont have any intimate knowledge here, but my guess is that it comes
> down to defining what quality scores to keep.  collapsing sequences is
> easy.  the sequences either they match or they dont.
>
> handling sequence+quality is harder.  would you keep the quality
> string with the highest total sum of qualities?  what if 2 have an
> identical sum, but different strings (which is probably not uncommon)?
>  in theory you could create a completely new quality string that
> attempts to gather average quality based on the quality score at each
> position.  all of these are possible, but it starts becoming less
> transparent and more complex.  collapsing FASTQ to FASTA is simply a
> way to remove that problem.
>
> a similar scenario I could imagine sometimes being useful would be
> 'collapse sequences, but don't count Ns as mismatches'.  certainly
> possible, but more complicated than simply collapsing reads that are
> 100% sequence-identical.
>
> once again, just my own thoughts here.  Assaf or someone from galaxy
> could perhaps answer better.
>
> -ben
>
>
>
> On Thu, Feb 3, 2011 at 12:22 PM, Johnson, Kory (NIH/NINDS) [C]
> <[email protected]> wrote:
>> Hello Galaxy users,
>>
>> Just to follow-up on my user group question described in the list-serv 
>> e-mail just sent out.
>>
>> I put forth the question about FASTQ collapse, as the FASTX-toolkit by Assaf 
>> Gordon describes the supported collapse tool as follows:
>>
>> "FASTQ/A Collapser, Collapsing identical sequences in a FASTQ/A file into a 
>> single sequence (while maintaining reads counts)"
>>
>> Yet, the collapse tool in Galaxy appears to be FASTA supported only?
>>
>> Why am I asking?
>>
>> Would like to remove duplicate reads in a FASTQ file by sequence, leaving 
>> one representative unique read having the best quality line among the 
>> duplicates it was identified from.
>>
>> Can certainly convert FASTQ to FASTA, then collapse, but if you do not have 
>> the qual file, you cannot reconstitute a FASTQ file with actual qual scores.
>>
>> Any argument for or against?  Or can Galaxy already do and I am missing the 
>> tool to actually use?
>>
>> Thanks ... best,
>>
>> Kory
>>
>> --------------------------------------------
>>
>> Kory R. Johnson, MS, PhD
>> Sr. Bioinformatics Scientist
>>
>>
>>
>> www.kellygovernmentsolutions.com
>>
>> Providing Contract Services For:
>>
>> Bioinformatics Section,
>> Information Technology & Bioinformatics Program,
>> Division of Intramural Research (DIR),
>> National Institute of Neurological Disorders & Stroke (NINDS),
>> National Institutes of Health (NIH),
>> Bethesda, Maryland
>>
>> Mailing Address:
>>
>> NINDS/NIH
>> Clinical Center (Building 10)
>> Office 5S223
>> 9000 Rockville Pike
>> Bethesda, MD 20892
>>
>> Contact Information:
>>
>> Phone:    301-402-1956
>> Fax:           301-480-3563
>> email:       [email protected]
>>
>>  Green Message:
>>
>> Please consider the environment before printing this e-mail.  Thank you.
>>
>> Important Message:
>>
>> This electronic message transmission contains information intended for the 
>> recipient only.  Such that, the information contained herein may be 
>> confidential, privaledged, or proprietary.  If you are not the intended 
>> recipient, be aware that any disclosure, copying, distribution, or use of 
>> this information is strictly prohibited.  If you have received this 
>> electronic information in error, please notify the sender immediately by 
>> telephone.  Thank you.
>>
>> --------------------------------------------
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]]
>> Sent: Thursday, February 03, 2011 12:52 PM
>> To: [email protected]
>> Subject: galaxy-user Digest, Vol 56, Issue 4
>>
>> Send galaxy-user mailing list submissions to
>>        [email protected]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>        http://lists.bx.psu.edu/listinfo/galaxy-user
>> or, via email, send a message with subject or body 'help' to
>>        [email protected]
>>
>> You can reach the person managing the list at
>>        [email protected]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of galaxy-user digest..."
>>
>>
>> HEY!  This is important!  If you reply to a thread in a digest, please
>> 1. Change the subject of your response from "Galaxy-user Digest Vol ..." to 
>> the original subject for the thread.
>> 2. Strip out everything else in the digest that is not part of the thread 
>> you are responding to.
>>
>> Why?
>> 1. This will keep the subject meaningful.  People will have some idea from 
>> the subject line if they should read it or not.
>> 2. Not doing this greatly increases the number of emails that match search 
>> queries, but that aren't actually informative.
>>
>> Today's Topics:
>>
>>   1. CuffDiff gene fpkm tracking file. (Samuele Gherardi)
>>   2. CuffDiff gene fpkm tracking file- Sorry! I sent only a part
>>      of my email (Samuele Gherardi)
>>   3. Re: listing attributes of data input (Peter)
>>   4. Re: CuffDiff gene fpkm tracking file. (Jeremy Goecks)
>>   5. Re: Downloadable Galaxy Virtual Machine in VMware
>>      (Haarst, Jan van)
>>   6. Re: Downloadable Galaxy Virtual Machine in VMware (Nate Coraor)
>>   7. FASTQ collapse? (Johnson, Kory (NIH/NINDS) [C])
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 3 Feb 2011 09:53:44 +0000
>> From: Samuele Gherardi <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Subject: [galaxy-user] CuffDiff gene fpkm tracking file.
>> Message-ID:
>>        
>> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it>
>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>>
>>
>> this is an example of my CuffDiff gene fpkm tracking file.
>>
>> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  
>> locus   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      
>> q2_conf_hi
>> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
>> 12260.8 12707.7 11447   11233.1 11661
>> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453  
>>    16.7235 9.41244 24.0346 19.437  11.7368 27.1371
>> XLOC_000003     -       -       SMCHD1  
>> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
>> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
>> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     
>> chr18:2880607-2882469   3.98118 0       7.99721 4.62875 0.278519        
>> 8.97899
>>
>> I this is normal, how can I find the class code of transcript listed in the 
>> CuffDiff gene expression file?
>>
>>
>> thank you in advance
>>
>> Samuele.
>>
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Thu, 3 Feb 2011 10:58:47 +0000
>> From: Samuele Gherardi <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Subject: [galaxy-user] CuffDiff gene fpkm tracking file- Sorry! I sent
>>        only a part of my email
>> Message-ID:
>>        
>> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it>
>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hello everybody,
>> I'm quite new in NGS world, I'm trying to analize dome RNA-seq data. I 
>> followed the workflow through tophat,cufflink,cuffcompare and cuffdiff
>> I suppose everything work fine but in the  Cuffdiff gene fpkm file the 
>> column Class_Code is empty and i don't know why?
>>
>> this is an example of my CuffDiff gene fpkm tracking file.
>>
>> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  
>> locus   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      
>> q2_conf_hi
>> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
>> 12260.8 12707.7 11447   11233.1 11661
>> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453  
>>    16.7235 9.41244 24.0346 19.437  11.7368 27.1371
>> XLOC_000003     -       -       SMCHD1  
>> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
>> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
>> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     
>> chr18:2880607-2882469   3.98118 0       7.99721 4.62875 0.278519        
>> 8.97899
>>
>> I this is normal, how can I find the class code of transcript listed in the 
>> CuffDiff gene expression file?
>>
>>
>> thank you in advance
>>
>> Samuele.
>>
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Thu, 3 Feb 2011 11:05:07 +0000
>> From: Peter <[email protected]>
>> To: Freddy de Bree <[email protected]>
>> Cc: [email protected]
>> Subject: Re: [galaxy-user] listing attributes of data input
>> Message-ID:
>>        <[email protected]>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> On Thu, Feb 3, 2011 at 8:40 AM, Freddy de Bree <[email protected]> wrote:
>>> Dear all,
>>>
>>> I was wondering if there is a place within the Galaxy content (a config, an
>>> xml file)
>>> that gives me some handles on how to address attributes of data input.
>>>
>>> For example, I can get the 'file_name' of data input, by addressing this
>>> attr
>>> as 'input.file_name', and for example, the original name of the data file
>>> can be addressed with 'input.name'.
>>>
>>> Anyone any clues where I can find this info?
>>>
>>> Freddy de Bree
>>
>> Some are given in examples on the main tool XML doc page,
>> https://bitbucket.org/galaxy/galaxy-central/wiki/ToolConfigSyntax
>>
>> Others I've noticed by looking at the provided XML wrappers,
>> and/or email list questions. For example, .ext or .extension gives
>> the Galaxy file type (e.g. fasta).
>>
>> Other than that, I guess you can always read the code - but I
>> agree that a document describing this would be nice to have.
>>
>> Peter
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Thu, 3 Feb 2011 09:14:19 -0500
>> From: Jeremy Goecks <[email protected]>
>> To: Samuele Gherardi <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Subject: Re: [galaxy-user] CuffDiff gene fpkm tracking file.
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset=us-ascii
>>
>>>
>>> this is an example of my CuffDiff gene fpkm tracking file.
>>>
>>> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  
>>> locus   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      
>>> q2_conf_hi
>>> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
>>> 12260.8 12707.7 11447   11233.1 11661
>>> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453 
>>>     16.7235 9.41244 24.0346 19.437  11.7368 27.1371
>>> XLOC_000003     -       -       SMCHD1  
>>> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
>>> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
>>> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     
>>> chr18:2880607-2882469   3.98118 0       7.99721 4.62875 0.278519        
>>> 8.97899
>>>
>>> I this is normal, how can I find the class code of transcript listed in the 
>>> CuffDiff gene expression file?
>>>
>>
>>
>> Hi Samuele,
>>
>> Without seeing your history, it's difficult to say for certain what your 
>> problem is. However, I'd guess that the GTF file that you're providing to 
>> Cuffdiff does not have the p_id attribute. You can produce a GTF file with 
>> both tss_id and p_id attributes by running Cuffcompare and using sequence 
>> data.
>>
>> Thanks,
>> J.
>>
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Thu, 3 Feb 2011 16:54:14 +0100
>> From: "Haarst, Jan van" <[email protected]>
>> To: "'Leon Mei'" <[email protected]>,
>>        "'[email protected]'"        <[email protected]>
>> Cc: 'David van Enckevort' <[email protected]>,        'Rob Hooft'
>>        <[email protected]>
>> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in
>>        VMware
>> Message-ID:
>>        <[email protected]>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> The download can also be done using bittorrent, torrent is available at 
>> http://www.biotorrents.net/details.php?id=136 .
>> This might be faster, as one of the peers is in Canada.
>>
>> With kind regards,
>> Jan
>>
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:galaxy-user-
>>> [email protected]] On Behalf Of Leon Mei
>>> Sent: Tuesday, February 01, 2011 1:08 PM
>>> To: [email protected]
>>> Cc: David van Enckevort; Rob Hooft
>>> Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware
>>>
>>> Dear list,
>>>
>>> Within the Netherlands Bioinformatics Centre, we have implemented a Galaxy 
>>> VM by
>>> wrapping up the distributed version from PennState. We are also adding more 
>>> easy-to-
>>> use pipelines for Genomics and Proteomics data analysis into this server at 
>>> the
>>> moment.
>>>
>>> You can download the current version at: http://bet1.nbiceng.net/galaxy/
>>>
>>> More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM
>>>
>>> By sharing this VM, we hope to relieve you from the normal installation and
>>> configuration steps (sometimes can be difficult) if you want to run a local 
>>> Galaxy
>>> instance.
>>>
>>> If you are interested in this procedure or have questions on that, we are 
>>> happy to
>>> share our experience and scripts.
>>>
>>> Best regards,
>>> Leon
>>>
>>> --
>>> Hailiang (Leon) Mei
>>> Netherlands Bioinformatics Center (http://www.nbic.nl/)
>>> Skype: leon_mei? ? Mobile: +31 6 41709231
>>>
>>> _______________________________________________
>>> galaxy-user mailing list
>>> [email protected]
>>> http://lists.bx.psu.edu/listinfo/galaxy-user
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Thu, 3 Feb 2011 11:37:01 -0500
>> From: Nate Coraor <[email protected]>
>> To: "Haarst, Jan van" <[email protected]>
>> Cc: "'[email protected]'" <[email protected]>,
>>        'Leon Mei' <[email protected]>,      'David van Enckevort'
>>        <[email protected]>,  'Rob Hooft' <[email protected]>
>> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in
>>        VMware
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset=iso-8859-1
>>
>> Haarst, Jan van wrote:
>>> The download can also be done using bittorrent, torrent is available at 
>>> http://www.biotorrents.net/details.php?id=136 .
>>> This might be faster, as one of the peers is in Canada.
>>
>> This is great!  I haven't checked the image out, but I'm fetching the
>> torrent now and will leave it seeding here from PSU to help out.
>>
>> Thanks,
>> --nate
>>
>>>
>>> With kind regards,
>>> Jan
>>>
>>>
>>> > -----Original Message-----
>>> > From: [email protected] [mailto:galaxy-user-
>>> > [email protected]] On Behalf Of Leon Mei
>>> > Sent: Tuesday, February 01, 2011 1:08 PM
>>> > To: [email protected]
>>> > Cc: David van Enckevort; Rob Hooft
>>> > Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware
>>> >
>>> > Dear list,
>>> >
>>> > Within the Netherlands Bioinformatics Centre, we have implemented a 
>>> > Galaxy VM by
>>> > wrapping up the distributed version from PennState. We are also adding 
>>> > more easy-to-
>>> > use pipelines for Genomics and Proteomics data analysis into this server 
>>> > at the
>>> > moment.
>>> >
>>> > You can download the current version at: http://bet1.nbiceng.net/galaxy/
>>> >
>>> > More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM
>>> >
>>> > By sharing this VM, we hope to relieve you from the normal installation 
>>> > and
>>> > configuration steps (sometimes can be difficult) if you want to run a 
>>> > local Galaxy
>>> > instance.
>>> >
>>> > If you are interested in this procedure or have questions on that, we are 
>>> > happy to
>>> > share our experience and scripts.
>>> >
>>> > Best regards,
>>> > Leon
>>> >
>>> > --
>>> > Hailiang (Leon) Mei
>>> > Netherlands Bioinformatics Center (http://www.nbic.nl/)
>>> > Skype: leon_mei? ? Mobile: +31 6 41709231
>>> >
>>> > _______________________________________________
>>> > galaxy-user mailing list
>>> > [email protected]
>>> > http://lists.bx.psu.edu/listinfo/galaxy-user
>>>
>>>
>>>
>>> _______________________________________________
>>> galaxy-user mailing list
>>> [email protected]
>>> http://lists.bx.psu.edu/listinfo/galaxy-user
>>>
>>
>>
>> ------------------------------
>>
>> Message: 7
>> Date: Thu, 3 Feb 2011 12:51:34 -0500
>> From: "Johnson, Kory (NIH/NINDS) [C]" <[email protected]>
>> To: "'[email protected]'" <[email protected]>
>> Subject: [galaxy-user] FASTQ collapse?
>> Message-ID:
>>        <[email protected]>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hello,
>>
>> Is there an option to collapse duplicate sequences in FASTQ format.
>>
>> I see collapse for FASTA, but where is it for FASTQ?
>>
>> Thank you,
>>
>> Kory
>>
>> --------------------------------------------
>>
>> Kory R. Johnson, MS, PhD
>> Sr. Bioinformatics Scientist
>>
>> [cid:[email protected]]
>>
>> www.kellygovernmentsolutions.com
>>
>> Providing Contract Services For:
>>
>> Bioinformatics Section,
>> Information Technology & Bioinformatics Program,
>> Division of Intramural Research (DIR),
>> National Institute of Neurological Disorders & Stroke (NINDS),
>> National Institutes of Health (NIH),
>> Bethesda, Maryland
>>
>> Mailing Address:
>>
>> NINDS/NIH
>> Clinical Center (Building 10)
>> Office 5S223
>> 9000 Rockville Pike
>> Bethesda, MD 20892
>>
>> Contact Information:
>>
>> Phone:    301-402-1956
>> Fax:           301-480-3563
>> email:       [email protected]
>>
>> P Green Message:
>>
>> Please consider the environment before printing this e-mail.  Thank you.
>>
>> Important Message:
>>
>> This electronic message transmission contains information intended for the 
>> recipient only.  Such that, the information contained herein may be 
>> confidential, privaledged, or proprietary.  If you are not the intended 
>> recipient, be aware that any disclosure, copying, distribution, or use of 
>> this information is strictly prohibited.  If you have received this 
>> electronic information in error, please notify the sender immediately by 
>> telephone.  Thank you.
>>
>> --------------------------------------------
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: 
>> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.html>
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: image001.jpg
>> Type: image/jpeg
>> Size: 2396 bytes
>> Desc: image001.jpg
>> URL: 
>> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.jpg>
>>
>> ------------------------------
>>
>> _______________________________________________
>> galaxy-user mailing list
>> [email protected]
>> http://lists.bx.psu.edu/listinfo/galaxy-user
>>
>>
>> End of galaxy-user Digest, Vol 56, Issue 4
>> ******************************************
>>
>> _______________________________________________
>> galaxy-user mailing list
>> [email protected]
>> http://lists.bx.psu.edu/listinfo/galaxy-user
>>
>

_______________________________________________
galaxy-user mailing list
[email protected]
http://lists.bx.psu.edu/listinfo/galaxy-user

Re: [galaxy-user] FASTQ collapse?

Reply via email to