Hey Carrie,

  Thanks for starting this conversation. We talked a while yesterday
on IRC, but I wanted to summarize our conversation and give everyone
an update.

  I am not saying no and certainly not saying no to all of it, but I
would be very hesitant to include the job staging / rewriting stuff.
The LWR has a lot of code to do this in a very tailored, flexible, and
powerful way.

   The approach you outline only works with scp and rewrites the
commands Galaxy generates after the fact - which may be error prone.
The LWR has many different ways it can potentially stage files (send
to LWR via HTTP, pull from Galaxy via HTTP, file system copy from
Galaxy, file system copy from remote host, disable staging) and it can
be configured to do this on a per path basis - which can be very
powerful given the complexity of how different file systems may be
shared where and mounted as what paths. The LWR can be configured to
affect the tool evaluation process so commands do not need to be
rewritten - they use the right paths from the get go.

  The LWR properly handles paths not just in the command - but in
configfiles, param files, metadata commands, from_work_dir outputs,
Galaxy's configured with outputs_to_working_directory option, etc....
The LWR can be configured to generate metadata remotely or locally,
the LWR can be configured to resolve dependencies (tool shed, Galaxy
packages, modules, etc...) remotely, locally, or just disable them.
The LWR works with newer features like job metrics and per-destination
env tweaking.

  I understand that the need to run a permanent service on the remote
compute login node can be problematic. So what I would like to see
(and where I am sure Galaxy will evolve too over the next year) is a
separation of "staging jobs" from "running jobs" - if we could do LWR
or LWR-like staging without requiring the use of the LWR job runner
(stage it like the LWR and submit directly via DRMAA or the CLI runner
for instance). (Aside: for instance LWR staging combined with running
jobs in docker will be wonderful from a security/isolation

  So if one wanted to do something like what you are doing - I would
be more eager to merge it if it were to somehow:

 - Leverage LWR staging with the CLI runner.
 - Add LWR staging action type for scp-ing files.

  That is a lot of work however :(. Given your use case - I think I
added to Galaxy today
that is a pretty different approach to this but may work equally well
(perhaps better in someways).

  For some background - thanks to great work by Nate,
https://test.galaxyproject.org/ is now running LWR on TACC's Stampede
supercomputer. We wanted to do it without opening the firewall on TACC
login node so the LWR can now be driven by a message queue instead.
Galaxy sends the LWR a submit message, LWR stages the job, submits it,
sends status updates via message queue back to Galaxy, and then the
LWR sends results back to Galaxy.

  While this process still requires a remote daemon running on the
compute's logiin node - it really doesn't need one. So I added some
new options to the LWR client to allow that initial submission message
to be encoded and in base64 and just passed to a simple command line
version of the LWR configured on the remote host. From there the rest
of the process works pretty much identically to the MQ-driven LWR
approach we are using with Stampede - the remote LWR script will pull
the needed files down from Galaxy, submit the job (in your case your
application submits itself to condor in chunks so you would wan to
just use the LWR equivalent of the local job runner - called
queued_python manager - its the default), send updates back to Galaxy
via a message queue, and then end once the job is finished.

  If this version of the compute flow doesn't sit well with you -
there are two changes I would definitely be eager to incorporate (feel
free to request or contribute them).

 - If you don't like requiring the message queue infrastructure - I
would love to see a variant of this that extended the jobs API to
allow status updates that way. (The file transfer for jobs - use a
single-propose key scheme to secure job related files - similar keys
could be used for status updates).

 - If instead you don't like the HTTP transfer and would prefer
scp/rcp - I would love to see more action types added to LWR's staging
setup to allow scping files between the Galaxy and the remote login
node (either initiated on the Galaxy side or the remote host - LWR
contains example actions similar to either).

Hope this helps.


On Tue, Jun 3, 2014 at 1:15 PM, Ganote, Carrie L <cgan...@iu.edu> wrote:
> Hi Devs,
> I'd like to open up some discussion about incorporating some code bits into
> the Galaxy distribution.
> My code is here:
> https://bitbucket.org/cganote/osg-blast-galaxy
> First off, I'd like to say that these changes were made initially as hacks
> to get Galaxy working with a grid interface for our nefarious purposes. For
> us, the results have been spiffy, in that we can offload a bunch of Blast
> work off of our own clusters and onto the grid, which processes them fast on
> a distributed set of computers.
> In order to do this, I wanted to be able to take as much control over the
> process as I could. The destination uses Condor, but it used condor_dag to
> submit jobs - that means I would have to modify the condor job runner.
> The destination needed to have the files shipped over to it first - so I had
> to be able to stage. This made lwr attractive, but then I would need to
> guarantee that the server at the other end was running lwr, and since I
> don't have control of that server, this seemed less likely to be a good
> option.
> The easiest thing for me to understand was the cli runner. I could do ssh, I
> could do scp, so this seemed the best place to start. So I started by trying
> to figure out which files needed to be sent to the server, and then
> implementing a way to send them. I start with stdout, stderr and exit code
> files. I also want to stage any datasets that are in the param_dict, and
> anything that is in extra_files_path. Then we alter the command line that is
> run such that all the paths make sense on the remote server, and to make
> sure that the right things are run remotely vs. locally (i.e., metadata.sh
> is run locally after job is done). Right now, this is done by splitting the
> command line on a specific string, which is not robust for future changes to
> the command_factory, but I'm open to suggestions.
> So, here's one hack. The hidden data tool parameter is something I hijacked
> - as far as I can tell, hidden data is only used for Cufflinks, so it seemed
> safe. I use it to send the shell script that will be run on the server (but
> NOT sent to the worker nodes). It needed to be a DATA type so that my stager
> would pick it up and send it over. I wanted it to be hidden because it was
> only used by the tool and it should not need to be an HDA. I made changes to
> allow the value of the hidden data to be set in the tool - this would become
> the false_path of the data, which would then become its actual path.
> Please have a look, and ask questions, and if there are improvements needed
> before anything is considered for pulling, let me know. I'd like to present
> this at the Galaxy conference without having vegetables thrown at me.
> Thanks!
> -Carrie Ganote
> National Center for Genome Analysis Support
> Indiana University
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

Reply via email to