On May 20, 2013, at 3:08 PM, Nate Coraor <n...@bx.psu.edu> wrote:

> On May 20, 2013, at 3:31 PM, Fields, Christopher J wrote:
>> On May 20, 2013, at 9:45 AM, Nate Coraor <n...@bx.psu.edu>
>> wrote:
>>> On May 19, 2013, at 11:41 PM, Fields, Christopher J wrote:
>>>> I've been seeing this error popping up quite a bit recently (we're using 
>>>> Torque 3.0.5), which is giving a general 'cluster failure' Galaxy error on 
>>>> jobs:
>>>> galaxy.jobs.runners.pbs WARNING 2013-05-19 18:56:44,073 (10588) pbs_submit 
>>>> failed (try 1/5), PBS error 15033: No free connections
>>>> This just recently started springing up after we had been using the 
>>>> cluster for over a year, not sure why it would start acting up now.  This 
>>>> appears to be related to a bug/defect with pbs_python, which doesn't seem 
>>>> to have been fixed yet (I posted a query whether this has been addressed):
>>>> https://oss.trac.surfsara.nl/pbs_python/ticket/29
>>>> Restarting helps, but are there any other recommended workarounds?  Is the 
>>>> only solution recompiling Torque with NCONNECTS?
>>> Hi Chris,
>>> I've always increased NCONNECTS to avoid this.  You may also want to 
>>> decrease the number of workers for the runners as this should decrease the 
>>> number of connections that Galaxy makes.
>>> You could also try the DRMAA runner.
>>> --nate
>> Hi Nate,
>> I reduced the handlers down and that seems to have taken care of it for now. 
>>  One of the authors of pbs_python (in the error report I commented on 
>> (https://oss.trac.surfsara.nl/pbs_python/ticket/29#comment:5) replied:
>> "Which framework do you use PBSQuery? Then there will be open/close with 
>> every query. If you use pbs_python you have to close the connetction and 
>> open it again. The number of connections is handle by the pbs_server and 
>> also the time that a connection can be open. This has nothing to do with pbs 
>> python code, maybe i can improve the error codes."
>> Not sure how to answer that one as I don't know exactly what Galaxy is doing 
>> internally.
> Hi Chris,
> We're disconnecting under all normal conditions and most error conditions - 
> it looks like only a few conditions would not properly disconnect:
> - If pbs_submit() fails 5 times in a row
> - If an exception is raised anywhere in the queue_job() method after 
> pbs_connect() is called
> - If the call to pbs_statjob() in check_all_jobs() or check_single_job() 
> raises an exception (if that's even possible)
> For the exceptions, you would see that an exception was caught in the log 
> file, so you should be able to determine if this is happening.
> For the pbs_submit() case, you'd see the message "All attempts to submit job 
> failed".
> You may want to move the call to pbs_connect() in queue_job() so that it 
> occurs immediately prior to the call to pbs_submit() and see if that makes a 
> difference.  The reason we connect so early on is to avoid writing out the 
> job's files if the PBS server doesn't exist anyway.

Yes, seeing the pbs_submit() case.  Lowering the # handlers does seem to help, 
but we still seem to run into this after 

One interesting note: we're using Torque 3.0.5.  I checked up on NCONNECTS, 
apparently in v3.0.5 it is now set to be the same as PBS_NET_MAX_CONNECTIONS, 
which is defined (in src/include/server_limits.h)  as 10240 by default.  So, 
recompiling doesn't seem necessary.  Might be easier to try the DRMAA route.

>> BTW, we're still using a pre-April Galaxy release, so we may test this again 
>> once we have updated to the latest (probably in the next month or so).  We 
>> may switch over to DRMAA if you find that more stable, would just need to 
>> schedule down time to recompile with DRMAA support and higher NCONNECTS.
> I haven't used DRMAA + TORQUE, but I know that there are sites that are using 
> it.  However, you'll want to use pbs-drmaa rather than TORQUE's libdrmaa as 
> per the docs:
>    http://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#DRMAA
> As an added bonus, I don't believe you need to recompile TORQUE to build 
> pbs-drmaa.
> --nate

Working on getting that set up now, as it seems to be the most obvious 
solution.  I might delve into this a bit more in the meantime.


Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

Reply via email to