UCSC workflow?

Jennifer Jackson Tue, 17 May 2011 09:25:12 -0700

Hello Curtis,

No need to use the fasta headers from your original fasta file.

To obtain the coordinates in BED format: using "Get Data -> UCSC main"again to link to the UCSC Table browser, set the same selection criteriaas for the original fasta sequence, only change the output type to be"BED" (instead of "sequence"). Once in your Galaxy history, this formatwill be easier to work with.


Best,

Jen
Galaxy team

On 5/16/11 9:04 PM, Robert Curtis Hendrickson wrote:

Jennifer,

Thanks for the outline. I'll try that approach.

However, it seems rather painful to have to join the fuzznuc output back to the original 
fasta to get at the header information that really should have been passed along. It would 
see that there must be a way to get the data out of UCSC without that space in the fasta 
header, so that the chromosome&  genomic location get correctly preserved in the fuzznuc 
output. Failing that, is there an easy text manipulation that would convert that fasta header 
space to a "|"?

Regards,
Curtis


-----Original Message-----
From: Jennifer Jackson [mailto:[email protected]]
Sent: Monday, May 16, 2011 6:50 PM
To: Robert Curtis Hendrickson
Cc: '[email protected]'
Subject: Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?

Hello Curtis,

The coordinates of your match are with respect to the fasta sequence,
not with respect to the reference genome. Only data mapped to the
reference genome can be viewed in the UCSC Browser

You will need to calculate from the position of the match in the fasta
sequence back through to the reference genome.

One suggested way to do this:

a) Merge together the original genomic coordinates of the 2kb regions
with each line of output from fuzznuc. Use the original source fasta
sequence name as the common key for the merge. If both data are in BED
format, that would be ideal and make the following steps possible. You
may need to split the file based on whether the original fasta sequence
came from the positive or negative strand to run "b" and "c" below
separately.
b) Use "Text Manipulation ->  Compute an expression on every row" to
create new coordinates. For example, if your data is on the positive
strand, and base 1 in your fasta file was genomic coordinate 100, and
the alignment from fuzznuc started at base 5 (local coordinate == "4" if
in BED format with a zero-based start), then the new genomic start
coordinate would be [100 + 4)] = 104. Do this for both start and stop.
c) Adjust the logic for "b" if any of your original fasta sequences are
from the negative strand, on the negative strand portion of your data
("b" would be run on just the positive strand portion of your data).
d) arrange/cut the resulting file down into a standard BED format to
remove the local coordinates and keep the genomic coordinates, using the
original chromosome names.
e) once the logic for the calculations is worked out, save the process
into a workflow for use again.

Hopefully this helps,

Best,

Jen
Galaxy team

On 5/13/11 9:32 AM, Robert Curtis Hendrickson wrote:

Folks,

I wanted to scan the 2kb upstream of a list of human gene isoforms for TFBS 
using fuzznuc. I was able to
"Get Data">   "UCSC Main">   "As sequence" and get my sequences
"EMBOSS">   fuzznuc ran fine, and output the hits

HOWEVER, fuzznuc lost the genomic position information that UCSC has put after 
a space in the sequence headers of the FASTA file. It only provided offsets 
within the fasta.

http://main.g2.bx.psu.edu/u/curtish-uab/h/ucsc-fuzznuc-ucsc-broken

Thus, when I converted the fuzznuc output back to a BED file and tried to visualize the 
hits in UCSC browser, it failed with "invalid BED File".
I tried fuzznuc with output: seqtable, feattable and gff3, but in all cases the 
genomic position was missing, and being a bit of Galaxy novice, I couldn't 
figure out how to get the output back to UCSC to visualize the hits.

Can anyone tell me how to link up these tools correctly, or share a history 
with some other tool set that accomplishes this goal?

Regards,
Curtis

Research Associate
Center for Clinical and Translational Science
University of Alabama at Birmingham

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

    http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

    http://lists.bx.psu.edu/


--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-user] UCSC->EMBOSS/fuzznuc->UCSC workflow?

Reply via email to