UCSC workflow?

Robert Curtis Hendrickson Tue, 07 Jun 2011 09:45:57 -0700

Jen, 

Thanks for all your help. 
Here's the final Galaxy workflow for doing FUZZNUC on a BED file from UCSC 
Table Browser, then producing BED file that you can view in UCSC.


http://main.g2.bx.psu.edu/u/curtish-uab/w/fuzznucucscbed

I do not include the "Get Flank" operation in this base workflow, but include a 
note in the description. 
I have not (yet) had time to make the score in the final BED dependent on the 
quality of the match, when mis-matches are allowed, but I hope to come back and 
add that later. 

How does one handle versioning of published workflows? Do updated the existing 
one, or create another with a .v2 name? 

Also, I used several "Text Manipulation> Compute" steps - is there any way to 
compute more than 1 new column at a time? 


Regards, 
Curtis



> -----Original Message-----
> From: Jennifer Jackson [mailto:[email protected]]
> Sent: Wednesday, May 18, 2011 11:45 AM
> To: Robert Curtis Hendrickson
> Cc: galaxy-user
> Subject: Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?
> 
> Hello Curtis,
> 
> The BED extraction data can be resolved in Galaxy. Pull out the whole
> gene and then modify the coordinates in Galaxy to be 10k upstream.
> 
> To be clear - this coordinate data is going to be used to transform the
> coordinates in your current fuzznuc output that is transcript-based to
> be genome-based. The coordinates are not input for fuzznuc - the are
> used after fuzznuc is run on the fasta file, in order to covert the
> result coordinates only.
> 
> This page in the UCSC wiki has a good description of how the UCSC
> coordinates are organized.
> http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
> 
> The output format for fuzznuc is documented in the tool's help - the
> last line on the tool form has a link.
> 
> Hopefully this helps to clear up the suggested processing,
> 
> Thanks,
> 
> Jen
> Galaxy team
> 
> 
> 
> On 5/17/11 2:08 PM, Robert Curtis Hendrickson wrote:
> > Jennifer,
> >
> > I tried getting data from UCSC as .BED - two issues:
> >
> > 1. Unlike "get sequence", I can no longer specify how far upstream I
> > want - it's EITHER "whole gene" (what's the definition of that!!!) OR
> > #bp_upstream OR exons OR introns -- with get seq those are not mutually
> > exclusive - I happen to want the genomic region (5'UTR, exons, introns
> > 3'UTR + 10kbp upstream of 5'UTR)
> >
> > 2. fuzznuc does not recognize BED as a valid input format. So, I can't
> > run fuzznuc because my BED file doesn't' show up in the pulldown.
> >
> > Indeed, BED files are just annotation, they don't carry any sequence.
> >
> > Have I mis-understood your directions?
> >
> > Regards,
> >
> > Curtis
> >
> > -----Original Message-----
> > From: Jennifer Jackson [mailto:[email protected]]
> > Sent: Tuesday, May 17, 2011 11:23 AM
> > To: Robert Curtis Hendrickson
> > Cc: '[email protected]'
> > Subject: Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?
> >
> > Hello Curtis,
> >
> > No need to use the fasta headers from your original fasta file.
> >
> > To obtain the coordinates in BED format: using "Get Data -> UCSC main"
> >
> > again to link to the UCSC Table browser, set the same selection criteria
> >
> > as for the original fasta sequence, only change the output type to be
> >
> > "BED" (instead of "sequence"). Once in your Galaxy history, this format
> >
> > will be easier to work with.
> >
> > Best,
> >
> > Jen
> >
> > Galaxy team
> >
> > On 5/16/11 9:04 PM, Robert Curtis Hendrickson wrote:
> >
> >  > Jennifer,
> >
> >  >
> >
> >  > Thanks for the outline. I'll try that approach.
> >
> >  >
> >
> >  > However, it seems rather painful to have to join the fuzznuc output
> > back to the original fasta to get at the header information that really
> > should have been passed along. It would see that there must be a way to
> > get the data out of UCSC without that space in the fasta header, so that
> > the chromosome& genomic location get correctly preserved in the fuzznuc
> > output. Failing that, is there an easy text manipulation that would
> > convert that fasta header space to a "|"?
> >
> >  >
> >
> >  > Regards,
> >
> >  > Curtis
> >
> >  >
> >
> >  >
> >
> >  > -----Original Message-----
> >
> >  > From: Jennifer Jackson [mailto:[email protected]]
> >
> >  > Sent: Monday, May 16, 2011 6:50 PM
> >
> >  > To: Robert Curtis Hendrickson
> >
> >  > Cc: '[email protected]'
> >
> >  > Subject: Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?
> >
> >  >
> >
> >  > Hello Curtis,
> >
> >  >
> >
> >  > The coordinates of your match are with respect to the fasta sequence,
> >
> >  > not with respect to the reference genome. Only data mapped to the
> >
> >  > reference genome can be viewed in the UCSC Browser
> >
> >  >
> >
> >  > You will need to calculate from the position of the match in the fasta
> >
> >  > sequence back through to the reference genome.
> >
> >  >
> >
> >  > One suggested way to do this:
> >
> >  >
> >
> >  > a) Merge together the original genomic coordinates of the 2kb regions
> >
> >  > with each line of output from fuzznuc. Use the original source fasta
> >
> >  > sequence name as the common key for the merge. If both data are in BED
> >
> >  > format, that would be ideal and make the following steps possible. You
> >
> >  > may need to split the file based on whether the original fasta sequence
> >
> >  > came from the positive or negative strand to run "b" and "c" below
> >
> >  > separately.
> >
> >  > b) Use "Text Manipulation -> Compute an expression on every row" to
> >
> >  > create new coordinates. For example, if your data is on the positive
> >
> >  > strand, and base 1 in your fasta file was genomic coordinate 100, and
> >
> >  > the alignment from fuzznuc started at base 5 (local coordinate == "4" if
> >
> >  > in BED format with a zero-based start), then the new genomic start
> >
> >  > coordinate would be [100 + 4)] = 104. Do this for both start and stop.
> >
> >  > c) Adjust the logic for "b" if any of your original fasta sequences are
> >
> >  > from the negative strand, on the negative strand portion of your data
> >
> >  > ("b" would be run on just the positive strand portion of your data).
> >
> >  > d) arrange/cut the resulting file down into a standard BED format to
> >
> >  > remove the local coordinates and keep the genomic coordinates, using the
> >
> >  > original chromosome names.
> >
> >  > e) once the logic for the calculations is worked out, save the process
> >
> >  > into a workflow for use again.
> >
> >  >
> >
> >  > Hopefully this helps,
> >
> >  >
> >
> >  > Best,
> >
> >  >
> >
> >  > Jen
> >
> >  > Galaxy team
> >
> >  >
> >
> >  > On 5/13/11 9:32 AM, Robert Curtis Hendrickson wrote:
> >
> >  >> Folks,
> >
> >  >>
> >
> >  >> I wanted to scan the 2kb upstream of a list of human gene isoforms
> > for TFBS using fuzznuc. I was able to
> >
> >  >> "Get Data"> "UCSC Main"> "As sequence" and get my sequences
> >
> >  >> "EMBOSS"> fuzznuc ran fine, and output the hits
> >
> >  >>
> >
> >  >> HOWEVER, fuzznuc lost the genomic position information that UCSC has
> > put after a space in the sequence headers of the FASTA file. It only
> > provided offsets within the fasta.
> >
> >  >>
> >
> >  >> http://main.g2.bx.psu.edu/u/curtish-uab/h/ucsc-fuzznuc-ucsc-broken
> >
> >  >>
> >
> >  >> Thus, when I converted the fuzznuc output back to a BED file and
> > tried to visualize the hits in UCSC browser, it failed with "invalid BED
> > File".
> >
> >  >> I tried fuzznuc with output: seqtable, feattable and gff3, but in
> > all cases the genomic position was missing, and being a bit of Galaxy
> > novice, I couldn't figure out how to get the output back to UCSC to
> > visualize the hits.
> >
> >  >>
> >
> >  >> Can anyone tell me how to link up these tools correctly, or share a
> > history with some other tool set that accomplishes this goal?
> >
> >  >>
> >
> >  >> Regards,
> >
> >  >> Curtis
> >
> >  >>
> >
> >  >> Research Associate
> >
> >  >> Center for Clinical and Translational Science
> >
> >  >> University of Alabama at Birmingham
> >
> >  >>
> >
> >  >> ___________________________________________________________
> >
> >  >> The Galaxy User list should be used for the discussion of
> >
> >  >> Galaxy analysis and other features on the public server
> >
> >  >> at usegalaxy.org. Please keep all replies on the list by
> >
> >  >> using "reply all" in your mail client. For discussion of
> >
> >  >> local Galaxy instances and the Galaxy source code, please
> >
> >  >> use the Galaxy Development list:
> >
> >  >>
> >
> >  >> http://lists.bx.psu.edu/listinfo/galaxy-dev
> >
> >  >>
> >
> >  >> To manage your subscriptions to this and other Galaxy lists,
> >
> >  >> please use the interface at:
> >
> >  >>
> >
> >  >> http://lists.bx.psu.edu/
> >
> >  >
> >
> > --
> >
> > Jennifer Jackson
> >
> > http://usegalaxy.org
> >
> > http://galaxyproject.org
> >
> 
> --
> Jennifer Jackson
> http://usegalaxy.org
> http://galaxyproject.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?

Reply via email to