Re: [galaxy-user] operate on genomic intervals

Jennifer Jackson Mon, 07 May 2012 09:03:34 -0700

Hi Jose,

Very glad to know that you have this working.

You question is difficult to address with specificity, as these arecompletely different algorithms. But in general, any alignment algorithm(Bowtie included) has some sort of indexing strategy (some are betterthan others) to minimize what is held in memory and process bulk datathrough. See the Bowtie documentation for how this is achieved.

The interval operations tools also have an indexing strategy,specifically, the second input file is the portion loaded memory and thefirst file is processed against it. So, if you want use an extremelylarge dataset (or just want to the job to run quicker) try to use it asthe first input file if possible.

These tools are designed to be used together and with other tools tocreate workflows, so there should pretty much always be some way tobreak jobs up (as you did) to get them through the tools, on even modestsystems: http://wiki.g2.bx.psu.edu/Learn/Interval%20Operations


Take care,

Jen
Galaxy team


On 5/7/12 8:30 AM, Xianrong Wong wrote:

Hello Jennifer,  thanks for the advise.  It worked when I did it a
chromosome at a time.  Is there a reason why this is so much more
computationally heavy as compared to bowtie?  (mapping 190 million reads
took only 3-4 hours for me)
Jose

On Fri, May 4, 2012 at 7:55 PM, Jennifer Jackson <[email protected]
<mailto:[email protected]>> wrote:

    Hello Jose,

    It sounds as if the job is running out of memory. Since you are
    already working on a cloud, I am going to make the assumption that
    you have explored the server options with high-capacity memory
    there. But if not, that is one place to start, in particular your
    EC2 Instance type, as described on this wiki:
    http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning
    <http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning>

    However, even if that was an option, you may want to consider
    running your in data through in another way - by running smaller
    jobs, then merging results, to avoid the large jobs. For example, in
    the last step where you join to the "full BamHI delimited bin file",
    instead join to groups of bins in that file (perhaps grouped by
    chromosome), then combine the results to produce the full output.

    Hopefully this helps provide some options,

    Jen
    Galaxy team


    On 5/4/12 2:18 PM, Xianrong Wong wrote:

        Hello,
                 I have binned the mouse genome into fragments based on
        restriction enzyme cut sites.  So each bin is a fragment flanked
        by say
        BamHI.  The output file is in the interval format: chr# start
        and end
        coordinates of each bin.  I want to count how many times each
        bin has
        reads that align to it.  I mapped my reads using bowtie and
        generated a
        dataset (interval format) for the aligned reads.  I then used
        join in
        "operate on genomic intervals" and asked it to return intervals that
        innerjoin the "bin file".  The subsequent steps involve grouping and
        counting and then joining back to the 1st dataset (BamHI delimited
        bins).  I have tried this workflow on small datasets and it worked.
        However when I subject my full alignment file and full BamHI
        delimited
        bin file, the tool fails.  I am doing this on cloud.  Any advice
        would
        be appreciated!

        Jose


        _____________________________________________________________
        The Galaxy User list should be used for the discussion of
        Galaxy analysis and other features on the public server
        at usegalaxy.org <http://usegalaxy.org>.  Please keep all
        replies on the list by
        using "reply all" in your mail client.  For discussion of
        local Galaxy instances and the Galaxy source code, please
        use the Galaxy Development list:

        http://lists.bx.psu.edu/__listinfo/galaxy-dev
        <http://lists.bx.psu.edu/listinfo/galaxy-dev>

        To manage your subscriptions to this and other Galaxy lists,
        please use the interface at:

        http://lists.bx.psu.edu/


    --
    Jennifer Jackson
    http://galaxyproject.org


--
Jennifer Jackson
http://galaxyproject.org
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-user] operate on genomic intervals

Reply via email to