i'm not terribly familiar with what galaxy offers along these lines, but google 'fastqc' or picard tools for a simple way to find the sort of quality distributions you just mentioned. it would take a little work on your end, but you could answer that questions. the former actually uses picard tools behind the scenes, but is a little more graphically oriented.
-ben On Thu, Feb 3, 2011 at 1:23 PM, Johnson, Kory (NIH/NINDS) [C] <[email protected]> wrote: > Hi Ben, > > Thanks for your reply! > > Yes, perhaps taking average, median, max quality value by position across > duplicate reads, using sliding window function, or simply randomly selecting > one quality line from those having the highest quality summed value is the > way to go. > > The complexity of problem I would think can be greatly reduced if "FASTQ > collapse" is done post FASTQ filtering by quality and/or length. > > It would also be an interesting question to ask::answer what the numbers look > like post filtering as far as duplicate sequences having the same quality > string vs differences by position vs base composition. > > Perhaps a FASTQ collapse tool could be developed in such a way to not only > remove duplicates and replace with a representative quality line, but also be > used as a way to perform a FASTQ filtering polish step based on discordance > rates by position across duplicates. > > Such that, a user can look at the discordance rates via box plot as you > would/can for quality scores observed across all reads, and pick a refined > criteria for filtering. > > Just an idea. > > Thanks again for taking the time to respond Ben. > > Best, > > Kory > > -------------------------------------------- > > Kory R. Johnson, MS, PhD > Sr. Bioinformatics Scientist > > > > www.kellygovernmentsolutions.com > > Providing Contract Services For: > > Bioinformatics Section, > Information Technology & Bioinformatics Program, > Division of Intramural Research (DIR), > National Institute of Neurological Disorders & Stroke (NINDS), > National Institutes of Health (NIH), > Bethesda, Maryland > > Mailing Address: > > NINDS/NIH > Clinical Center (Building 10) > Office 5S223 > 9000 Rockville Pike > Bethesda, MD 20892 > > Contact Information: > > Phone: 301-402-1956 > Fax: 301-480-3563 > email: [email protected] > > Green Message: > > Please consider the environment before printing this e-mail. Thank you. > > Important Message: > > This electronic message transmission contains information intended for the > recipient only. Such that, the information contained herein may be > confidential, privaledged, or proprietary. If you are not the intended > recipient, be aware that any disclosure, copying, distribution, or use of > this information is strictly prohibited. If you have received this > electronic information in error, please notify the sender immediately by > telephone. Thank you. > > -------------------------------------------- > > > -----Original Message----- > From: Ben Bimber [mailto:[email protected]] > Sent: Thursday, February 03, 2011 1:37 PM > To: Johnson, Kory (NIH/NINDS) [C]; [email protected] > Subject: Re: [galaxy-user] FASTQ collapse? > > i dont have any intimate knowledge here, but my guess is that it comes > down to defining what quality scores to keep. collapsing sequences is > easy. the sequences either they match or they dont. > > handling sequence+quality is harder. would you keep the quality > string with the highest total sum of qualities? what if 2 have an > identical sum, but different strings (which is probably not uncommon)? > in theory you could create a completely new quality string that > attempts to gather average quality based on the quality score at each > position. all of these are possible, but it starts becoming less > transparent and more complex. collapsing FASTQ to FASTA is simply a > way to remove that problem. > > a similar scenario I could imagine sometimes being useful would be > 'collapse sequences, but don't count Ns as mismatches'. certainly > possible, but more complicated than simply collapsing reads that are > 100% sequence-identical. > > once again, just my own thoughts here. Assaf or someone from galaxy > could perhaps answer better. > > -ben > > > > On Thu, Feb 3, 2011 at 12:22 PM, Johnson, Kory (NIH/NINDS) [C] > <[email protected]> wrote: >> Hello Galaxy users, >> >> Just to follow-up on my user group question described in the list-serv >> e-mail just sent out. >> >> I put forth the question about FASTQ collapse, as the FASTX-toolkit by Assaf >> Gordon describes the supported collapse tool as follows: >> >> "FASTQ/A Collapser, Collapsing identical sequences in a FASTQ/A file into a >> single sequence (while maintaining reads counts)" >> >> Yet, the collapse tool in Galaxy appears to be FASTA supported only? >> >> Why am I asking? >> >> Would like to remove duplicate reads in a FASTQ file by sequence, leaving >> one representative unique read having the best quality line among the >> duplicates it was identified from. >> >> Can certainly convert FASTQ to FASTA, then collapse, but if you do not have >> the qual file, you cannot reconstitute a FASTQ file with actual qual scores. >> >> Any argument for or against? Or can Galaxy already do and I am missing the >> tool to actually use? >> >> Thanks ... best, >> >> Kory >> >> -------------------------------------------- >> >> Kory R. Johnson, MS, PhD >> Sr. Bioinformatics Scientist >> >> >> >> www.kellygovernmentsolutions.com >> >> Providing Contract Services For: >> >> Bioinformatics Section, >> Information Technology & Bioinformatics Program, >> Division of Intramural Research (DIR), >> National Institute of Neurological Disorders & Stroke (NINDS), >> National Institutes of Health (NIH), >> Bethesda, Maryland >> >> Mailing Address: >> >> NINDS/NIH >> Clinical Center (Building 10) >> Office 5S223 >> 9000 Rockville Pike >> Bethesda, MD 20892 >> >> Contact Information: >> >> Phone: 301-402-1956 >> Fax: 301-480-3563 >> email: [email protected] >> >> Green Message: >> >> Please consider the environment before printing this e-mail. Thank you. >> >> Important Message: >> >> This electronic message transmission contains information intended for the >> recipient only. Such that, the information contained herein may be >> confidential, privaledged, or proprietary. If you are not the intended >> recipient, be aware that any disclosure, copying, distribution, or use of >> this information is strictly prohibited. If you have received this >> electronic information in error, please notify the sender immediately by >> telephone. Thank you. >> >> -------------------------------------------- >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] >> Sent: Thursday, February 03, 2011 12:52 PM >> To: [email protected] >> Subject: galaxy-user Digest, Vol 56, Issue 4 >> >> Send galaxy-user mailing list submissions to >> [email protected] >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.bx.psu.edu/listinfo/galaxy-user >> or, via email, send a message with subject or body 'help' to >> [email protected] >> >> You can reach the person managing the list at >> [email protected] >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of galaxy-user digest..." >> >> >> HEY! This is important! If you reply to a thread in a digest, please >> 1. Change the subject of your response from "Galaxy-user Digest Vol ..." to >> the original subject for the thread. >> 2. Strip out everything else in the digest that is not part of the thread >> you are responding to. >> >> Why? >> 1. This will keep the subject meaningful. People will have some idea from >> the subject line if they should read it or not. >> 2. Not doing this greatly increases the number of emails that match search >> queries, but that aren't actually informative. >> >> Today's Topics: >> >> 1. CuffDiff gene fpkm tracking file. (Samuele Gherardi) >> 2. CuffDiff gene fpkm tracking file- Sorry! I sent only a part >> of my email (Samuele Gherardi) >> 3. Re: listing attributes of data input (Peter) >> 4. Re: CuffDiff gene fpkm tracking file. (Jeremy Goecks) >> 5. Re: Downloadable Galaxy Virtual Machine in VMware >> (Haarst, Jan van) >> 6. Re: Downloadable Galaxy Virtual Machine in VMware (Nate Coraor) >> 7. FASTQ collapse? (Johnson, Kory (NIH/NINDS) [C]) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 3 Feb 2011 09:53:44 +0000 >> From: Samuele Gherardi <[email protected]> >> To: "[email protected]" <[email protected]> >> Subject: [galaxy-user] CuffDiff gene fpkm tracking file. >> Message-ID: >> >> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it> >> >> Content-Type: text/plain; charset="iso-8859-1" >> >> >> >> this is an example of my CuffDiff gene fpkm tracking file. >> >> tracking_id class_code nearest_ref_id gene_short_name tss_id >> locus q1_FPKM q1_conf_lo q1_conf_hi q2_FPKM q2_conf_lo >> q2_conf_hi >> XLOC_000001 - - MT-ND5 - chrM:0-16571 12484.2 >> 12260.8 12707.7 11447 11233.1 11661 >> XLOC_000002 - - USP14 TSS1,TSS2,TSS3 chr18:148586-236453 >> 16.7235 9.41244 24.0346 19.437 11.7368 27.1371 >> XLOC_000003 - - SMCHD1 >> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540 >> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262 >> XLOC_000004 - - EMILIN2 TSS13,TSS14 >> chr18:2880607-2882469 3.98118 0 7.99721 4.62875 0.278519 >> 8.97899 >> >> I this is normal, how can I find the class code of transcript listed in the >> CuffDiff gene expression file? >> >> >> thank you in advance >> >> Samuele. >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Thu, 3 Feb 2011 10:58:47 +0000 >> From: Samuele Gherardi <[email protected]> >> To: "[email protected]" <[email protected]> >> Subject: [galaxy-user] CuffDiff gene fpkm tracking file- Sorry! I sent >> only a part of my email >> Message-ID: >> >> <025db19130de0b43bbd868bdb9244a82c...@e10-mbx3-dr.personale.dir.unibo.it> >> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Hello everybody, >> I'm quite new in NGS world, I'm trying to analize dome RNA-seq data. I >> followed the workflow through tophat,cufflink,cuffcompare and cuffdiff >> I suppose everything work fine but in the Cuffdiff gene fpkm file the >> column Class_Code is empty and i don't know why? >> >> this is an example of my CuffDiff gene fpkm tracking file. >> >> tracking_id class_code nearest_ref_id gene_short_name tss_id >> locus q1_FPKM q1_conf_lo q1_conf_hi q2_FPKM q2_conf_lo >> q2_conf_hi >> XLOC_000001 - - MT-ND5 - chrM:0-16571 12484.2 >> 12260.8 12707.7 11447 11233.1 11661 >> XLOC_000002 - - USP14 TSS1,TSS2,TSS3 chr18:148586-236453 >> 16.7235 9.41244 24.0346 19.437 11.7368 27.1371 >> XLOC_000003 - - SMCHD1 >> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540 >> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262 >> XLOC_000004 - - EMILIN2 TSS13,TSS14 >> chr18:2880607-2882469 3.98118 0 7.99721 4.62875 0.278519 >> 8.97899 >> >> I this is normal, how can I find the class code of transcript listed in the >> CuffDiff gene expression file? >> >> >> thank you in advance >> >> Samuele. >> >> >> >> ------------------------------ >> >> Message: 3 >> Date: Thu, 3 Feb 2011 11:05:07 +0000 >> From: Peter <[email protected]> >> To: Freddy de Bree <[email protected]> >> Cc: [email protected] >> Subject: Re: [galaxy-user] listing attributes of data input >> Message-ID: >> <[email protected]> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> On Thu, Feb 3, 2011 at 8:40 AM, Freddy de Bree <[email protected]> wrote: >>> Dear all, >>> >>> I was wondering if there is a place within the Galaxy content (a config, an >>> xml file) >>> that gives me some handles on how to address attributes of data input. >>> >>> For example, I can get the 'file_name' of data input, by addressing this >>> attr >>> as 'input.file_name', and for example, the original name of the data file >>> can be addressed with 'input.name'. >>> >>> Anyone any clues where I can find this info? >>> >>> Freddy de Bree >> >> Some are given in examples on the main tool XML doc page, >> https://bitbucket.org/galaxy/galaxy-central/wiki/ToolConfigSyntax >> >> Others I've noticed by looking at the provided XML wrappers, >> and/or email list questions. For example, .ext or .extension gives >> the Galaxy file type (e.g. fasta). >> >> Other than that, I guess you can always read the code - but I >> agree that a document describing this would be nice to have. >> >> Peter >> >> >> ------------------------------ >> >> Message: 4 >> Date: Thu, 3 Feb 2011 09:14:19 -0500 >> From: Jeremy Goecks <[email protected]> >> To: Samuele Gherardi <[email protected]> >> Cc: "[email protected]" <[email protected]> >> Subject: Re: [galaxy-user] CuffDiff gene fpkm tracking file. >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset=us-ascii >> >>> >>> this is an example of my CuffDiff gene fpkm tracking file. >>> >>> tracking_id class_code nearest_ref_id gene_short_name tss_id >>> locus q1_FPKM q1_conf_lo q1_conf_hi q2_FPKM q2_conf_lo >>> q2_conf_hi >>> XLOC_000001 - - MT-ND5 - chrM:0-16571 12484.2 >>> 12260.8 12707.7 11447 11233.1 11661 >>> XLOC_000002 - - USP14 TSS1,TSS2,TSS3 chr18:148586-236453 >>> 16.7235 9.41244 24.0346 19.437 11.7368 27.1371 >>> XLOC_000003 - - SMCHD1 >>> TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540 >>> 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262 >>> XLOC_000004 - - EMILIN2 TSS13,TSS14 >>> chr18:2880607-2882469 3.98118 0 7.99721 4.62875 0.278519 >>> 8.97899 >>> >>> I this is normal, how can I find the class code of transcript listed in the >>> CuffDiff gene expression file? >>> >> >> >> Hi Samuele, >> >> Without seeing your history, it's difficult to say for certain what your >> problem is. However, I'd guess that the GTF file that you're providing to >> Cuffdiff does not have the p_id attribute. You can produce a GTF file with >> both tss_id and p_id attributes by running Cuffcompare and using sequence >> data. >> >> Thanks, >> J. >> >> >> ------------------------------ >> >> Message: 5 >> Date: Thu, 3 Feb 2011 16:54:14 +0100 >> From: "Haarst, Jan van" <[email protected]> >> To: "'Leon Mei'" <[email protected]>, >> "'[email protected]'" <[email protected]> >> Cc: 'David van Enckevort' <[email protected]>, 'Rob Hooft' >> <[email protected]> >> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in >> VMware >> Message-ID: >> <[email protected]> >> Content-Type: text/plain; charset="iso-8859-1" >> >> The download can also be done using bittorrent, torrent is available at >> http://www.biotorrents.net/details.php?id=136 . >> This might be faster, as one of the peers is in Canada. >> >> With kind regards, >> Jan >> >> >>> -----Original Message----- >>> From: [email protected] [mailto:galaxy-user- >>> [email protected]] On Behalf Of Leon Mei >>> Sent: Tuesday, February 01, 2011 1:08 PM >>> To: [email protected] >>> Cc: David van Enckevort; Rob Hooft >>> Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware >>> >>> Dear list, >>> >>> Within the Netherlands Bioinformatics Centre, we have implemented a Galaxy >>> VM by >>> wrapping up the distributed version from PennState. We are also adding more >>> easy-to- >>> use pipelines for Genomics and Proteomics data analysis into this server at >>> the >>> moment. >>> >>> You can download the current version at: http://bet1.nbiceng.net/galaxy/ >>> >>> More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM >>> >>> By sharing this VM, we hope to relieve you from the normal installation and >>> configuration steps (sometimes can be difficult) if you want to run a local >>> Galaxy >>> instance. >>> >>> If you are interested in this procedure or have questions on that, we are >>> happy to >>> share our experience and scripts. >>> >>> Best regards, >>> Leon >>> >>> -- >>> Hailiang (Leon) Mei >>> Netherlands Bioinformatics Center (http://www.nbic.nl/) >>> Skype: leon_mei? ? Mobile: +31 6 41709231 >>> >>> _______________________________________________ >>> galaxy-user mailing list >>> [email protected] >>> http://lists.bx.psu.edu/listinfo/galaxy-user >> >> >> >> >> >> ------------------------------ >> >> Message: 6 >> Date: Thu, 3 Feb 2011 11:37:01 -0500 >> From: Nate Coraor <[email protected]> >> To: "Haarst, Jan van" <[email protected]> >> Cc: "'[email protected]'" <[email protected]>, >> 'Leon Mei' <[email protected]>, 'David van Enckevort' >> <[email protected]>, 'Rob Hooft' <[email protected]> >> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in >> VMware >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset=iso-8859-1 >> >> Haarst, Jan van wrote: >>> The download can also be done using bittorrent, torrent is available at >>> http://www.biotorrents.net/details.php?id=136 . >>> This might be faster, as one of the peers is in Canada. >> >> This is great! I haven't checked the image out, but I'm fetching the >> torrent now and will leave it seeding here from PSU to help out. >> >> Thanks, >> --nate >> >>> >>> With kind regards, >>> Jan >>> >>> >>> > -----Original Message----- >>> > From: [email protected] [mailto:galaxy-user- >>> > [email protected]] On Behalf Of Leon Mei >>> > Sent: Tuesday, February 01, 2011 1:08 PM >>> > To: [email protected] >>> > Cc: David van Enckevort; Rob Hooft >>> > Subject: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware >>> > >>> > Dear list, >>> > >>> > Within the Netherlands Bioinformatics Centre, we have implemented a >>> > Galaxy VM by >>> > wrapping up the distributed version from PennState. We are also adding >>> > more easy-to- >>> > use pipelines for Genomics and Proteomics data analysis into this server >>> > at the >>> > moment. >>> > >>> > You can download the current version at: http://bet1.nbiceng.net/galaxy/ >>> > >>> > More documents can be found at: https://wiki.nbic.nl/index.php/Galaxy_VM >>> > >>> > By sharing this VM, we hope to relieve you from the normal installation >>> > and >>> > configuration steps (sometimes can be difficult) if you want to run a >>> > local Galaxy >>> > instance. >>> > >>> > If you are interested in this procedure or have questions on that, we are >>> > happy to >>> > share our experience and scripts. >>> > >>> > Best regards, >>> > Leon >>> > >>> > -- >>> > Hailiang (Leon) Mei >>> > Netherlands Bioinformatics Center (http://www.nbic.nl/) >>> > Skype: leon_mei? ? Mobile: +31 6 41709231 >>> > >>> > _______________________________________________ >>> > galaxy-user mailing list >>> > [email protected] >>> > http://lists.bx.psu.edu/listinfo/galaxy-user >>> >>> >>> >>> _______________________________________________ >>> galaxy-user mailing list >>> [email protected] >>> http://lists.bx.psu.edu/listinfo/galaxy-user >>> >> >> >> ------------------------------ >> >> Message: 7 >> Date: Thu, 3 Feb 2011 12:51:34 -0500 >> From: "Johnson, Kory (NIH/NINDS) [C]" <[email protected]> >> To: "'[email protected]'" <[email protected]> >> Subject: [galaxy-user] FASTQ collapse? >> Message-ID: >> <[email protected]> >> Content-Type: text/plain; charset="us-ascii" >> >> Hello, >> >> Is there an option to collapse duplicate sequences in FASTQ format. >> >> I see collapse for FASTA, but where is it for FASTQ? >> >> Thank you, >> >> Kory >> >> -------------------------------------------- >> >> Kory R. Johnson, MS, PhD >> Sr. Bioinformatics Scientist >> >> [cid:[email protected]] >> >> www.kellygovernmentsolutions.com >> >> Providing Contract Services For: >> >> Bioinformatics Section, >> Information Technology & Bioinformatics Program, >> Division of Intramural Research (DIR), >> National Institute of Neurological Disorders & Stroke (NINDS), >> National Institutes of Health (NIH), >> Bethesda, Maryland >> >> Mailing Address: >> >> NINDS/NIH >> Clinical Center (Building 10) >> Office 5S223 >> 9000 Rockville Pike >> Bethesda, MD 20892 >> >> Contact Information: >> >> Phone: 301-402-1956 >> Fax: 301-480-3563 >> email: [email protected] >> >> P Green Message: >> >> Please consider the environment before printing this e-mail. Thank you. >> >> Important Message: >> >> This electronic message transmission contains information intended for the >> recipient only. Such that, the information contained herein may be >> confidential, privaledged, or proprietary. If you are not the intended >> recipient, be aware that any disclosure, copying, distribution, or use of >> this information is strictly prohibited. If you have received this >> electronic information in error, please notify the sender immediately by >> telephone. Thank you. >> >> -------------------------------------------- >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.html> >> -------------- next part -------------- >> A non-text attachment was scrubbed... >> Name: image001.jpg >> Type: image/jpeg >> Size: 2396 bytes >> Desc: image001.jpg >> URL: >> <http://lists.bx.psu.edu/pipermail/galaxy-user/attachments/20110203/17864960/attachment.jpg> >> >> ------------------------------ >> >> _______________________________________________ >> galaxy-user mailing list >> [email protected] >> http://lists.bx.psu.edu/listinfo/galaxy-user >> >> >> End of galaxy-user Digest, Vol 56, Issue 4 >> ****************************************** >> >> _______________________________________________ >> galaxy-user mailing list >> [email protected] >> http://lists.bx.psu.edu/listinfo/galaxy-user >> > _______________________________________________ galaxy-user mailing list [email protected] http://lists.bx.psu.edu/listinfo/galaxy-user

