Re: [galaxy-dev] Using Mesos to Enable distributed computing under Galaxy?
Hey Kyle, all, If anyone wants to play with running Galaxy jobs within an Apache Mesos environment I have added a prototype of this feature to the LWR. https://bitbucket.org/jmchilton/lwr/commits/555438d2fe266899338474b25c540fef42bcece7 https://bitbucket.org/jmchilton/lwr/commits/9748b3035dbe3802d4136a6a1028df8395a9aeb3 This work distributes jobs across a Mesos cluster and injects a MESOS_URL environment variable into the job runtime environment in case the jobs themselves want to take advantage of Mesos. The advantage of the LWR versus a traditional Galaxy runner is that the job can be staged to remote resources without shared disk. Prior to this I was imaging the LWR to be useful in cases where Galaxy and remote cluster don't share common disk but where there is in fact a shared scratch directory or something across the remote cluster as well a resource manager. The LWR Mesos framework however has the actual compute servers themselves stage the job up and down - so you could imagine distributing Galaxy across large clusters without any shared disk whatsoever - that could be very cool and help scale say cloud applications. Downsides of an LWR-based approach versus a Galaxy approach is that it is less mature and there is more stuff to configure - need to configure a Galaxy job_conf plugin and destination, need to configure the LWR itself, need to configure a message queue (for this variant of LWR operation anyway - it should be possible to drive this via the LWR in web server mode but I haven't added it yet). I would be more than happy to continue to see progress toward Mesos support in Galaxy proper. It is strictly a prototype so far - a sort of playground if anyone wants to play with these ideas and build something cool. It really is a framework right - not so much a job scheduler so I am not sure it is very immediately useful - but I imagine one could build cool stuff on top of it. Next, I think I would like to add Apache Aurora (http://aurora.incubator.apache.org/) support - because it seems like a much more traditional resource manager but built on top of Mesos so it would be more practical for traditional Galaxy-style jobs. Doesn't buy you anything in terms of parallelization but it would fit better with Galaxy. -John On Sat, Oct 26, 2013 at 2:43 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I think one of the aspects where Galaxy is a bit soft is the ability to do distributed tasks. The current system of split/replicate/merge tasks based on file type is a bit limited and hard for tool developers to expand upon. Distributed computing is a non-trival thing to implement and I think it would be a better use of our time to use an already existing framework. And it would also mean one less API for tool writers to have to develop for. I was wondering if anybody has looked at Mesos ( http://mesos.apache.org/ ). You can see an overview of the Mesos architecture at https://github.com/apache/mesos/blob/master/docs/Mesos-Architecture.md The important thing about Mesos is that it provides an API for C/C++, Java/Scala and Python to write distributed frameworks. There are already implementations of frameworks for common parallel programming systems such as: - Hadoop (https://github.com/mesos/hadoop) - MPI (https://github.com/apache/mesos/blob/master/docs/Running-torque-or-mpi-on-mesos.md) - Spark (http://spark-project.org) And you can find example Python framework at https://github.com/apache/mesos/tree/master/src/examples/python Integration with Galaxy would have three parts: 1) Add a system config variable to Galaxy called 'MESOS_URL' that is then passed to tool wrappers and allows them to contact the local mesos infrastructure (assuming the system has been configured) or pass a null if the system isn't available. 2) Write a tool runner that works as a mesos framework to executes single cpu jobs on the distributed system. 3) For instances where mesos is not available at a system wide level (say they only have access to an SGE based cluster), but the user wants to run distributed jobs, write a wrapper that can create a mesos cluster using the existing queueing system. For example, right now I run a Mesos system under the SGE queue system. I'm curious to see what other people think. Kyle ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at:
Re: [galaxy-dev] Per-tool configuration
Hello Jan, Thanks for the clarification. Not quite what I was expecting so I am glad I asked - I don't have great answers for either case so hopefully other people will have some ideas. For the first use case - I would just specify some default input to supply to the input wrapper - lets call this N - add a parameter to the tool wrapper --limit-size=N - test that and then allow it to be overridden via an environment variable - so in your command block use --limit-size=\${BLAST_QUERY_LIMIT:N}. This will use N is not limit is set, but deployers can set limits. There are a number of ways to set such variables - DRM specific environment files, login rc files, etc Just this last release I added the ability to define environment variables right in job_conf.xml (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specification-of-environment/diff). I thought the tool shed might have a way to collect such definitions as well and insert them into package files - but Google failed to find this for me. Not sure about how to proceed with the second use case - extending the .loc file should work locally - I am not sure it is feasible within the context of the existing tool shed tools, data manager, etc You could certainly duplicate this stuff with your modifications - this how down sides in terms of interoperability though. Sorry I don't have great answers for either question, -John On Sat, Jun 14, 2014 at 5:12 AM, Jan Kanis jan.c...@jankanis.nl wrote: I have two use cases: the first is for a modification of the ncbi blast wrapper to limit the query input size (for a publically accessible galaxy instance), so this needs a configuration option for the query size limit. I was thinking about a separate config file in tool-data for this. The second is is for a tool I have written to convert a blast xml output into a html report. The report contains links for each match to a gene bank (e.g. the ncbi database). These links should be configurable per database that was searched, and preferrably have an option of linking to the location of the match within the gene if the gene bank supports such links. One option is to add an extra column to the blast .loc files (if that doesn't break blast), where the databases are already configured. Jan Op 13 jun. 2014 18:02 schreef John Chilton jmchil...@gmail.com het volgende: I would have different answers for your depending on what options are available to the server admin. What exactly about the tool is configurable - can you be more specific? -John On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis jan.c...@jankanis.nl wrote: I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration? Jan ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/