Re: [galaxy-user] FASTQ collapse?

Johnson, Kory (NIH/NINDS) [C] Thu, 03 Feb 2011 11:29:04 -0800

Hi Ben,

Thanks for your reply!


Yes, perhaps taking average, median, max quality value by position across 
duplicate reads, using sliding window function, or simply randomly selecting 
one quality line from those having the highest quality summed value is the way 
to go.

The complexity of problem I would think can be greatly reduced if "FASTQ 
collapse" is done post FASTQ filtering by quality and/or length.

It would also be an interesting question to ask::answer what the numbers look 
like post filtering as far as duplicate sequences having the same quality 
string vs differences by position vs base composition.

Perhaps a FASTQ collapse tool could be developed in such a way to not only 
remove duplicates and replace with a representative quality line, but also be 
used as a way to perform a FASTQ filtering polish step based on discordance 
rates by position across duplicates.

Such that, a user can look at the discordance rates via box plot as you 
would/can for quality scores observed across all reads, and pick a refined 
criteria for filtering.

Just an idea.

Thanks again for taking the time to respond Ben.

Best,

Kory

--------------------------------------------

Kory R. Johnson, MS, PhD
Sr. Bioinformatics Scientist



www.kellygovernmentsolutions.com

Providing Contract Services For:

Bioinformatics Section,
Information Technology & Bioinformatics Program,
Division of Intramural Research (DIR),
National Institute of Neurological Disorders & Stroke (NINDS),
National Institutes of Health (NIH),
Bethesda, Maryland

Mailing Address:

NINDS/NIH
Clinical Center (Building 10)
Office 5S223
9000 Rockville Pike
Bethesda, MD 20892

Contact Information:

Phone:    301-402-1956
Fax:           301-480-3563
email:       [email protected]

 Green Message:

Please consider the environment before printing this e-mail.  Thank you.

Important Message:

This electronic message transmission contains information intended for the 
recipient only.  Such that, the information contained herein may be 
confidential, privaledged, or proprietary.  If you are not the intended 
recipient, be aware that any disclosure, copying, distribution, or use of this 
information is strictly prohibited.  If you have received this electronic 
information in error, please notify the sender immediately by telephone.  Thank 
you.

--------------------------------------------


-----Original Message-----
From: Ben Bimber [mailto:[email protected]]
Sent: Thursday, February 03, 2011 1:37 PM
To: Johnson, Kory (NIH/NINDS) [C]; [email protected]
Subject: Re: [galaxy-user] FASTQ collapse?

i dont have any intimate knowledge here, but my guess is that it comes
down to defining what quality scores to keep.  collapsing sequences is
easy.  the sequences either they match or they dont.

handling sequence+quality is harder.  would you keep the quality
string with the highest total sum of qualities?  what if 2 have an
identical sum, but different strings (which is probably not uncommon)?
 in theory you could create a completely new quality string that
attempts to gather average quality based on the quality score at each
position.  all of these are possible, but it starts becoming less
transparent and more complex.  collapsing FASTQ to FASTA is simply a
way to remove that problem.

a similar scenario I could imagine sometimes being useful would be
'collapse sequences, but don't count Ns as mismatches'.  certainly
possible, but more complicated than simply collapsing reads that are
100% sequence-identical.

once again, just my own thoughts here.  Assaf or someone from galaxy
could perhaps answer better.

-ben



On Thu, Feb 3, 2011 at 12:22 PM, Johnson, Kory (NIH/NINDS) [C]
<[email protected]> wrote:
> Hello Galaxy users,
>
> Just to follow-up on my user group question described in the list-serv e-mail 
> just sent out.
>
> I put forth the question about FASTQ collapse, as the FASTX-toolkit by Assaf 
> Gordon describes the supported collapse tool as follows:
>
> "FASTQ/A Collapser, Collapsing identical sequences in a FASTQ/A file into a 
> single sequence (while maintaining reads counts)"
>
> Yet, the collapse tool in Galaxy appears to be FASTA supported only?
>
> Why am I asking?
>
> Would like to remove duplicate reads in a FASTQ file by sequence, leaving one 
> representative unique read having the best quality line among the duplicates 
> it was identified from.
>
> Can certainly convert FASTQ to FASTA, then collapse, but if you do not have 
> the qual file, you cannot reconstitute a FASTQ file with actual qual scores.
>
> Any argument for or against?  Or can Galaxy already do and I am missing the 
> tool to actually use?
>
> Thanks ... best,
>
> Kory
>
> --------------------------------------------
>
> Kory R. Johnson, MS, PhD
> Sr. Bioinformatics Scientist
>
>
>
> www.kellygovernmentsolutions.com
>
> Providing Contract Services For:
>
> Bioinformatics Section,
> Information Technology & Bioinformatics Program,
> Division of Intramural Research (DIR),
> National Institute of Neurological Disorders & Stroke (NINDS),
> National Institutes of Health (NIH),
> Bethesda, Maryland
>
> Mailing Address:
>
> NINDS/NIH
> Clinical Center (Building 10)
> Office 5S223
> 9000 Rockville Pike
> Bethesda, MD 20892
>
> Contact Information:
>
> Phone:    301-402-1956
> Fax:           301-480-3563
> email:       [email protected]
>
>  Green Message:
>
> Please consider the environment before printing this e-mail.  Thank you.
>
> Important Message:
>
> This electronic message transmission contains information intended for the 
> recipient only.  Such that, the information contained herein may be 
> confidential, privaledged, or proprietary.  If you are not the intended 
> recipient, be aware that any disclosure, copying, distribution, or use of 
> this information is strictly prohibited.  If you have received this 
> electronic information in error, please notify the sender immediately by 
> telephone.  Thank you.
>
> --------------------------------------------
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]]
> Sent: Thursday, February 03, 2011 12:52 PM
> To: [email protected]
> Subject: galaxy-user Digest, Vol 56, Issue 4
>
> Send galaxy-user mailing list submissions to
>        [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.bx.psu.edu/listinfo/galaxy-user
> or, via email, send a message with subject or body 'help' to
>        [email protected]
>
> You can reach the person managing the list at
>        [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of galaxy-user digest..."
>
>
> HEY!  This is important!  If you reply to a thread in a digest, please
> 1. Change the subject of your response from "Galaxy-user Digest Vol ..." to 
> the original subject for the thread.
> 2. Strip out everything else in the digest that is not part of the thread you 
> are responding to.
>
> Why?
> 1. This will keep the subject meaningful.  People will have some idea from 
> the subject line if they should read it or not.
> 2. Not doing this greatly increases the number of emails that match search 
> queries, but that aren't actually informative.
>
> Today's Topics:
>
>   1. CuffDiff gene fpkm tracking file. (Samuele Gherardi)
>   2. CuffDiff gene fpkm tracking file- Sorry! I sent only a part
>      of my email (Samuele Gherardi)
>   3. Re: listing attributes of data input (Peter)
>   4. Re: CuffDiff gene fpkm tracking file. (Jeremy Goecks)
>   5. Re: Downloadable Galaxy Virtual Machine in VMware
>      (Haarst, Jan van)
>   6. Re: Downloadable Galaxy Virtual Machine in VMware (Nate Coraor)
>   7. FASTQ collapse? (Johnson, Kory (NIH/NINDS) [C])
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 3 Feb 2011 09:53:44 +0000
> From: Samuele Gherardi <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [galaxy-user] CuffDiff gene fpkm tracking file.
> Message-ID:
>        
> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
>
> this is an example of my CuffDiff gene fpkm tracking file.
>
> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  locus 
>   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      q2_conf_hi
> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
> 12260.8 12707.7 11447   11233.1 11661
> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453   
>   16.7235 9.41244 24.0346 19.437  11.7368 27.1371
> XLOC_000003     -       -       SMCHD1  
> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     chr18:2880607-2882469 
>   3.98118 0       7.99721 4.62875 0.278519        8.97899
>
> I this is normal, how can I find the class code of transcript listed in the 
> CuffDiff gene expression file?
>
>
> thank you in advance
>
> Samuele.
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 3 Feb 2011 10:58:47 +0000
> From: Samuele Gherardi <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [galaxy-user] CuffDiff gene fpkm tracking file- Sorry! I sent
>        only a part of my email
> Message-ID:
>        
> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello everybody,
> I'm quite new in NGS world, I'm trying to analize dome RNA-seq data. I 
> followed the workflow through tophat,cufflink,cuffcompare and cuffdiff
> I suppose everything work fine but in the  Cuffdiff gene fpkm file the column 
> Class_Code is empty and i don't know why?
>
> this is an example of my CuffDiff gene fpkm tracking file.
>
> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  locus 
>   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      q2_conf_hi
> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
> 12260.8 12707.7 11447   11233.1 11661
> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453   
>   16.7235 9.41244 24.0346 19.437  11.7368 27.1371
> XLOC_000003     -       -       SMCHD1  
> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     chr18:2880607-2882469 
>   3.98118 0       7.99721 4.62875 0.278519        8.97899
>
> I this is normal, how can I find the class code of transcript listed in the 
> CuffDiff gene expression file?
>
>
> thank you in advance
>
> Samuele.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 3 Feb 2011 11:05:07 +0000
> From: Peter <[email protected]>
> To: Freddy de Bree <[email protected]>
> Cc: [email protected]
> Subject: Re: [galaxy-user] listing attributes of data input
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Thu, Feb 3, 2011 at 8:40 AM, Freddy de Bree <[email protected]> wrote:
>> Dear all,
>>
>> I was wondering if there is a place within the Galaxy content (a config, an
>> xml file)
>> that gives me some handles on how to address attributes of data input.
>>
>> For example, I can get the 'file_name' of data input, by addressing this
>> attr
>> as 'input.file_name', and for example, the original name of the data file
>> can be addressed with 'input.name'.
>>
>> Anyone any clues where I can find this info?
>>
>> Freddy de Bree
>
> Some are given in examples on the main tool XML doc page,
> https://bitbucket.org/galaxy/galaxy-central/wiki/ToolConfigSyntax
>
> Others I've noticed by looking at the provided XML wrappers,
> and/or email list questions. For example, .ext or .extension gives
> the Galaxy file type (e.g. fasta).
>
> Other than that, I guess you can always read the code - but I
> agree that a document describing this would be nice to have.
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 3 Feb 2011 09:14:19 -0500
> From: Jeremy Goecks <[email protected]>
> To: Samuele Gherardi <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [galaxy-user] CuffDiff gene fpkm tracking file.
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
>
>>
>> this is an example of my CuffDiff gene fpkm tracking file.
>>
>> tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  
>> locus   q1_FPKM q1_conf_lo      q1_conf_hi      q2_FPKM q2_conf_lo      
>> q2_conf_hi
>> XLOC_000001     -       -       MT-ND5  -       chrM:0-16571    12484.2 
>> 12260.8 12707.7 11447   11233.1 11661
>> XLOC_000002     -       -       USP14   TSS1,TSS2,TSS3  chr18:148586-236453  
>>    16.7235 9.41244 24.0346 19.437  11.7368 27.1371
>> XLOC_000003     -       -       SMCHD1  
>> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540   
>> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262
>> XLOC_000004     -       -       EMILIN2 TSS13,TSS14     
>> chr18:2880607-2882469   3.98118 0       7.99721 4.62875 0.278519        
>> 8.97899
>>
>> I this is normal, how can I find the class code of transcript listed in the 
>> CuffDiff gene expression file?
>>
>
>
> Hi Samuele,
>
> Without seeing your history, it's difficult to say for certain what your 
> problem is. However, I'd guess that the GTF file that you're providing to 
> Cuffdiff does not have the p_id attribute. You can produce a GTF file with 
> both tss_id and p_id attributes by running Cuffcompare and using sequence 
> data.
>
> Thanks,
> J.
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 3 Feb 2011 16:54:14 +0100
> From: "Haarst, Jan van" <[email protected]>
> To: "'Leon Mei'" <[email protected]>,
>        "'[email protected]'"        <[email protected]>
> Cc: 'David van Enckevort' <[email protected]>,        'Rob Hooft'
>        <[email protected]>
> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in
>        VMware
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> The download can also be done using bittorrent, torrent is available at 
> http://www.biotorrents.net/details.php?id=136 .
> This might be faster, as one of the peers is in Canada.
>
> With kind regards,
> Jan
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:galaxy-user-
>> [email protected]] On Behalf Of Leon Mei
>> Sent: Tuesday, February 01, 2011 1:08 PM
>> To: [email protected]
>> Cc: David van Enckevort; Rob Hooft
>> Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware
>>
>> Dear list,
>>
>> Within the Netherlands Bioinformatics Centre, we have implemented a Galaxy 
>> VM by
>> wrapping up the distributed version from PennState. We are also adding more 
>> easy-to-
>> use pipelines for Genomics and Proteomics data analysis into this server at 
>> the
>> moment.
>>
>> You can download the current version at: http://bet1.nbiceng.net/galaxy/
>>
>> More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM
>>
>> By sharing this VM, we hope to relieve you from the normal installation and
>> configuration steps (sometimes can be difficult) if you want to run a local 
>> Galaxy
>> instance.
>>
>> If you are interested in this procedure or have questions on that, we are 
>> happy to
>> share our experience and scripts.
>>
>> Best regards,
>> Leon
>>
>> --
>> Hailiang (Leon) Mei
>> Netherlands Bioinformatics Center (http://www.nbic.nl/)
>> Skype: leon_mei? ? Mobile: +31 6 41709231
>>
>> _______________________________________________
>> galaxy-user mailing list
>> [email protected]
>> http://lists.bx.psu.edu/listinfo/galaxy-user
>
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 3 Feb 2011 11:37:01 -0500
> From: Nate Coraor <[email protected]>
> To: "Haarst, Jan van" <[email protected]>
> Cc: "'[email protected]'" <[email protected]>,
>        'Leon Mei' <[email protected]>,      'David van Enckevort'
>        <[email protected]>,  'Rob Hooft' <[email protected]>
> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in
>        VMware
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=iso-8859-1
>
> Haarst, Jan van wrote:
>> The download can also be done using bittorrent, torrent is available at 
>> http://www.biotorrents.net/details.php?id=136 .
>> This might be faster, as one of the peers is in Canada.
>
> This is great!  I haven't checked the image out, but I'm fetching the
> torrent now and will leave it seeding here from PSU to help out.
>
> Thanks,
> --nate
>
>>
>> With kind regards,
>> Jan
>>
>>
>> > -----Original Message-----
>> > From: [email protected] [mailto:galaxy-user-
>> > [email protected]] On Behalf Of Leon Mei
>> > Sent: Tuesday, February 01, 2011 1:08 PM
>> > To: [email protected]
>> > Cc: David van Enckevort; Rob Hooft
>> > Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware
>> >
>> > Dear list,
>> >
>> > Within the Netherlands Bioinformatics Centre, we have implemented a Galaxy 
>> > VM by
>> > wrapping up the distributed version from PennState. We are also adding 
>> > more easy-to-
>> > use pipelines for Genomics and Proteomics data analysis into this server 
>> > at the
>> > moment.
>> >
>> > You can download the current version at: http://bet1.nbiceng.net/galaxy/
>> >
>> > More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM
>> >
>> > By sharing this VM, we hope to relieve you from the normal installation and
>> > configuration steps (sometimes can be difficult) if you want to run a 
>> > local Galaxy
>> > instance.
>> >
>> > If you are interested in this procedure or have questions on that, we are 
>> > happy to
>> > share our experience and scripts.
>> >
>> > Best regards,
>> > Leon
>> >
>> > --
>> > Hailiang (Leon) Mei
>> > Netherlands Bioinformatics Center (http://www.nbic.nl/)
>> > Skype: leon_mei? ? Mobile: +31 6 41709231
>> >
>> > _______________________________________________
>> > galaxy-user mailing list
>> > [email protected]
>> > http://lists.bx.psu.edu/listinfo/galaxy-user
>>
>>
>>
>> _______________________________________________
>> galaxy-user mailing list
>> [email protected]
>> http://lists.bx.psu.edu/listinfo/galaxy-user
>>
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 3 Feb 2011 12:51:34 -0500
> From: "Johnson, Kory (NIH/NINDS) [C]" <[email protected]>
> To: "'[email protected]'" <[email protected]>
> Subject: [galaxy-user] FASTQ collapse?
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
>
> Hello,
>
> Is there an option to collapse duplicate sequences in FASTQ format.
>
> I see collapse for FASTA, but where is it for FASTQ?
>
> Thank you,
>
> Kory
>
> --------------------------------------------
>
> Kory R. Johnson, MS, PhD
> Sr. Bioinformatics Scientist
>
> [cid:[email protected]]
>
> www.kellygovernmentsolutions.com
>
> Providing Contract Services For:
>
> Bioinformatics Section,
> Information Technology & Bioinformatics Program,
> Division of Intramural Research (DIR),
> National Institute of Neurological Disorders & Stroke (NINDS),
> National Institutes of Health (NIH),
> Bethesda, Maryland
>
> Mailing Address:
>
> NINDS/NIH
> Clinical Center (Building 10)
> Office 5S223
> 9000 Rockville Pike
> Bethesda, MD 20892
>
> Contact Information:
>
> Phone:    301-402-1956
> Fax:           301-480-3563
> email:       [email protected]
>
> P Green Message:
>
> Please consider the environment before printing this e-mail.  Thank you.
>
> Important Message:
>
> This electronic message transmission contains information intended for the 
> recipient only.  Such that, the information contained herein may be 
> confidential, privaledged, or proprietary.  If you are not the intended 
> recipient, be aware that any disclosure, copying, distribution, or use of 
> this information is strictly prohibited.  If you have received this 
> electronic information in error, please notify the sender immediately by 
> telephone.  Thank you.
>
> --------------------------------------------
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image001.jpg
> Type: image/jpeg
> Size: 2396 bytes
> Desc: image001.jpg
> URL: 
> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.jpg>
>
> ------------------------------
>
> _______________________________________________
> galaxy-user mailing list
> [email protected]
> http://lists.bx.psu.edu/listinfo/galaxy-user
>
>
> End of galaxy-user Digest, Vol 56, Issue 4
> ******************************************
>
> _______________________________________________
> galaxy-user mailing list
> [email protected]
> http://lists.bx.psu.edu/listinfo/galaxy-user
>

_______________________________________________
galaxy-user mailing list
[email protected]
http://lists.bx.psu.edu/listinfo/galaxy-user

Re: [galaxy-user] FASTQ collapse?

Reply via email to