Hi again

Oh I see... Not the best way to deal with the resources indeed but better than nothing though.

I do run Java tools via Galaxy but I haven't paid attention to this issue, so I can't really help. It's a homemade tool that only has one class so I guess it's not worth the effort in my case. But if you find the answer, I'd be interested too.

Good luck,
L-A


On 05/19/2011 04:28 PM, Leandro Hermida wrote:
Hi Louise,

I see thank you for the response, maybe there was some confusion, the feature I was trying to explain with LSF is that you *don't* need to tell it the required resources for a job and it will still be able to run all the submitted jobs on a node without crashing even if the jobs submitted need e.g. 10 more cores are available (that is 10 more cores than LSF thought it needed). LSF will just temporarily suspend jobs in mid-run on a node to keep the load down, but nothing will ever crash even if you are running jobs that require 20 CPUs and you only have 2. Thought maybe there was a way to do this with TORQUE. If LSF or TORQUE are explcitly passed the resources needed then they will never need to temporarily suspend because they will pick a node with those resources free. That being said, your method is more efficient for this reason as it will pick the right node with the cores available instead of picking a node with maybe just one core available and then running the multithreaded job slower because it has to periodically suspend it as it is running.

Also I wonder do you run any Java command-line tools via Galaxy? I can't seem to find with the JVM exactly how many cores it needs during execution or how to limit it to a certain max, it just jumps around in CPU usage from 50% to over 400%

regards,
Leandro

2011/5/19 Louise-Amélie Schmitt <louise-amelie.schm...@embl.de <mailto:louise-amelie.schm...@embl.de>>

    Hi again Leandro

    Well I might not have been really clear, perhaps I should have
    re-read the mail before posting it :)

    The thing is, it was not an issue of Torque starting jobs when
    there were not enough resources available, but rather it believing
    the needed resources for each job being fewer that they were (e.g.
    always assuming the jobs were single-threaded even if the actual
    tools needed more tan one core). if Torque is properly notified of
    the needed resources, it will dispatch them or make them wait
    accordingly (since it knows the nodes' limits and load), like your
    LSF does.

    This hack is not very sexy but it just notifies Torque of the
    cores needed by every multithreaded tool, so it doesn't run a
    multithreaded job when there's only one core available in the
    chosen node.

    Hope that helps :)

    Regards,
    L-A



    On 05/19/2011 03:05 PM, Leandro Hermida wrote:
    Hi Louise-Amelie,

    Thank you for the post reference, this is exactly what I was
    looking for.  For us for for example when I want to execute a
    tool that is a Java command the JVM typically will typically use
    multiple cores as it's running.  You said with TORQUE it will
    crash when there aren't enough resources when the job is
    submitted.  I wonder if you can do the same thing we have done
    here with LSF?  With LSF you can configure a maximum server load
    for each node and if the submitted jobs push the node load above
    this threshold (e.g. more cores requested than available) LSF
    will temporarily suspend jobs (using some kind of heuristics) so
    that the load stays below the threshold and unsuspend as
    resources become available.  So for us things just will run
    slower when we cannot pass the requested number of cores to LSF.

    I would think maybe there is a way with TORQUE to have it achieve
    the same thing so jobs don't crash when resources requested are
    more than available?

    regards,
    Leandro

    2011/5/19 Louise-Amélie Schmitt <louise-amelie.schm...@embl.de
    <mailto:louise-amelie.schm...@embl.de>>

        Hi,

        In a previous message, I explained how I did to multithreads
        certain jobs, perhaps you can modify the corresponding files
        for drmaa in a similar way:

        On 04/26/2011 11:26 AM, Louise-Amélie Schmitt wrote:
        Just one little fix on line 261:
        261                 if ( len(l)>  1 and l[0] ==job_wrapper.tool.id  
<http://job_wrapper.tool.id>  ):

        Otherwise it pathetically crashes when non-multithreaded jobs are
        submitted. Sorry about that.

        Regards,
        L-A

        Le mardi 19 avril 2011 à 14:33 +0200, Louise-Amélie Schmitt a écrit :
        Hello everyone,

        I'm using TORQUE with Galaxy, and we noticed that if a tool is
        multithreaded, the number of needed cores is not communicated to pbs,
        leading to job crashes if the required resources are not available when
        the job is submitted.

        Therefore I modified a little the code as follows in
        lib/galaxy/jobs/runners/pbs.py

        256         # define PBS job options
        257         attrs.append( dict( name = pbs.ATTR_N, value = str( "%s_%s_%
        s" % ( job_wrapper.job_id,job_wrapper.tool.id  
<http://job_wrapper.tool.id>, job_wrapper.user ) ) ) )
        258         mt_file = open('tool-data/multithreading.csv', 'r')
        259         for l in mt_file:
        260                 l = string.split(l)
        261                 if ( l[0] ==job_wrapper.tool.id  
<http://job_wrapper.tool.id>  ):
        262                         attrs.append( dict( name = pbs.ATTR_l,
        resource = 'nodes', value = '1:ppn='+str(l[1]) ) )
        263                         attrs.append( dict( name = pbs.ATTR_l,
        resource = 'mem', value = str(l[2]) ) )
        264                         break
        265         mt_file.close()
        266         job_attrs = pbs.new_attropl( len( attrs ) +
        len( pbs_options ) )

        (sorry it didn't come out very well due to line breaking)

        The csv file contains a list of the multithreaded tools, each line
        containing:
        <tool id>\t<number of threads>\t<memory needed>\n

        And it works fine, the jobs wait for their turn properly, but
        information is duplicated. Perhaps there would be a way to include
        something similar in galaxy's original code (if it is not already the
        case, I may not be up-to-date) without duplicating data.

        I hope that helps :)

        Best regards,
        L-A

        ___________________________________________________________
        The Galaxy User list should be used for the discussion of
        Galaxy analysis and other features on the public server
        atusegalaxy.org  <http://usegalaxy.org>.  Please keep all replies on 
the list by
        using "reply all" in your mail client.  For discussion of
        local Galaxy instances and the Galaxy source code, please
        use the Galaxy Development list:

           http://lists.bx.psu.edu/listinfo/galaxy-dev

        To manage your subscriptions to this and other Galaxy lists,
        please use the interface at:

           http://lists.bx.psu.edu/


        On 05/19/2011 12:03 PM, Leandro Hermida wrote:
        Hi,

        When Galaxy is configured to use the DRMAA job runner is
        there a way for a tool to tell DRMAA the number of cores it
        would like to request? The equivalent of bsub -n X in LSF
        where X is min number of cores to have available on node.

        best,
        leandro


        ___________________________________________________________
        Please keep all replies on the list by using "reply all"
        in your mail client.  To manage your subscriptions to this
        and other Galaxy lists, please use the interface at:

           http://lists.bx.psu.edu/





___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to