Re: [Genome] Retrieve multiz alignment bases for many regions using script?

Jennifer Jackson Fri, 02 Apr 2010 10:41:18 -0700

Hello John,

Sorry to hear about the problems transferring data over to Galaxy. It is 
a long process. It would be much better to download the files locally 
and use some of the utilities from the kent source tree.


Some popular MAF options are below.

You could also look at the maf* programs from the utility set and see 
which will perform the exact function(s) that you need:
http://genomewiki.cse.ucsc.edu/index.php/Kent_source_utilities

Hopefully this helps!
Jennifer
-------

$ mafFrags
mafFrags - Collect MAFs from regions specified in a 6 column bed file
usage:
    mafFrags database track in.bed out.maf
options:
    -orgs=org.txt - File with list of databases/organisms in order
    -bed12 - If set, in.bed is a bed 12 file, including exons
    -thickOnly - Only extract subset between thickStart/thickEnd
    -meFirst - Put native sequence first in maf
    -txStarts - Add MAF txstart region definitions ('r' lines) using BED 
name
     and output actual reference genome coordinates in MAF.
    -refCoords - output actual reference genome coordinates in MAF.

$ mafsInRegion
mafsInRegion - Extract MAFS in a genomic region
usage:
     mafsInRegion regions.bed out.maf|outDir in.maf(s)
options:
     -outDir - output separate files named by bed name field to outDir
     -keepInitialGaps - keep alignment columns at the beginning and of a 
block that are gapped in all species

$ mafSplit
mafSplit - Split multiple alignment files
usage:
    mafSplit splits.bed outRoot file(s).maf
options:
    -byTarget       Make one file per target sequence.  (splits.bed input
                    is ignored).
    -outDirDepth=N  For use only with -byTarget.
                    Create N levels of output directory under current dir.
                    This helps prevent NFS problems with a large number of
                    file in a directory.  Using -outDirDepth=3 would
                    produce ./1/2/3/outRoot123.maf.
    -useSequenceName  For use only with -byTarget.
                      Instead of auto-incrementing an integer to determine
                      output filename, expect each target sequence name to
                      end with a unique number and use that number as the
                      integer to tack onto outRoot.
    -useHashedName=N  For use only with -byTarget.
                      Instead of auto-incrementing an integer or requiring
                      a unique number in the sequence name, use a hash
                      function on the sequence name to compute an N-bit
                      number.  This limits the max #filenames to 2^N and
                      ensures that even if different subsets of sequences
                      appear in different pairwise mafs, the split file
                      names will be consistent (due to hash function).
                      This option is useful when a "scaffold-based"
                      assembly has more than one sequence name pattern,
                      e.g. both chroms and scaffolds.


---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 4/2/10 2:52 AM, John Reid wrote:
> Hi,
>
> I'm trying to retrieve regions that are aligned to oRegAnno annotations
> in several species. Initially I've started with mouse. I can use the
> UCSC DAS capability to retrieve the oRegAnno features and retrieve the
> DNA sequences for them. I can't work out how to get to bases in other
> species using DAS or any other method that I can script.
>
> I found this advice from Jennifer Jackson in an old post on this newsgroup:
>> Using the tools at UCSC, the Table Browser will return blocks of 
>> Conservation MAF results, but not specific bases. However, by sending the 
>> data over to Galaxy, "slices" of the Conservation track's MAF alignment can 
>> be retrieved in batch using a custom track of intervals (down to a single 
>> base).
>>
>> To do this:
>>
>> 1) Create and load a custom track in BED format of the genome positions of 
>> interest
>> 2) Send the custom track to Galaxy by extracting it from the Table browser 
>> and checking Galaxy as the output choice
>> 3) Send of the Conservation track's MAF alignment data to Galaxy using same 
>> method (you may need to subset this by chromosome to improve 
>> speed/performance)
>> 3) Use the Galaxy tools: Fetch Alignments ->  Extract MAF blocks given a set 
>> of genomic intervals
>>
>> UCSC help is as follows:
>>
>> http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#CustomTracks
>> http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#TableBrowser
>> http://genome.ucsc.edu/FAQ/FAQformat#format1
>> http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
>>
>> Galaxy help is available at their web site if you have questions about the 
>> tools.
>>
>
> I tried this but it took hours just to load the conservation track for
> mouse chromosome 10 into galaxy. I have several organisms, each with
> many chromosomes. Is there a better way to do it? It's frustrating
> because with a few clicks on the Genome Browser web interface I can
> retrieve the information I need for one particular region but I can't
> work out how to write a script to retrieve it.
>
> Thanks in advance,
> John.
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Retrieve multiz alignment bases for many regions using script?

Reply via email to