only

Dannon Baker Wed, 13 Nov 2013 12:03:40 -0800

A short term option that just occurred to me would be to run a sort of
post-job-action for output datasets, deleting any non-output non-necessary
(anymore) intermediate parents.



On Wed, Nov 13, 2013 at 11:59 AM, John Chilton <chil...@msi.umn.edu> wrote:

> On Wed, Nov 13, 2013 at 10:34 AM, Peter Cock <p.j.a.c...@googlemail.com>
> wrote:
> > On Tue, Nov 12, 2013 at 7:13 PM, Ben Gift <corn8b...@gmail.com> wrote:
> >> I'm working with a lot of data on a cluster (condor). If I save all the
> >> workflow intermediate data, as Galaxy does by default (and rightfully
> so),
> >> it fills the drives.
> >>
> >> How can tell Galaxy to use /tmp/ to store all intermediate data in a
> >> workflow, and keep the result?
> >
> > You can't - for a start /tmp is usually machine specific so the /tmp
> > used by one cluster node is probably not going to be available
> > on the /tmp of the other cluster nodes, and different stages of
> > the workflow are likely to be run on different cluster nodes.
> >
> >> I imagine I'll have to work on how Galaxy handles jobs, but I'm
> >> hoping there is something built in for this.
> >
> > Workflows can mark the output datasets, and the rest are
> > automatically hidden/deleted on successful completion
> > (but kept and visible on request via the history menu).
> >
> > It might be nice if we could make that more aggressive and
> > actually purge the intermediate files from disk as well?
>
> Ability to have these deleted is not available, but it should be an
> option. Here is the most relevant Trello card.
>
> https://trello.com/c/YfLGkJKe
>
> Even this small step will probably require tracking some concept of a
> running workflow in the database or a message queue, I don't think
> this is being done currently but I think Dannon is working on the
> queue piece.
>
> Once that is in place, there are still many things that could be done
> better in arena. Nate has mentioned building functionality into object
> stores and job planning so that data could be pre-staged where it
> needs to be ahead of time in a workflow.
>
> Along similar lines, one could also imagine implementing/configuring
> an object store that simply wrote files that are pre-marked for
> deletion (once implemented) to faster staging/scratch disk on the
> cluster. Having this advanced planning logic built in are probably
> prerequistes to allowing the use of named pipes or in memory data
> files some day.
>
> A lot of things to work on and there is a long way to go. I have
> created a Trello card for this and will link to this thread. But it
> should probably be spelled out more concretely and broken into
> multiple cards.
>
> https://trello.com/c/dUMOHHmM
>
> -John
>
> >
> > Peter
> > ___________________________________________________________
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this
> > and other Galaxy lists, please use the interface at:
> >   http://lists.bx.psu.edu/
> >
> > To search Galaxy mailing lists use the unified search at:
> >   http://galaxyproject.org/search/mailinglists/
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Write all intermediate workflow data to /tmp/ only

Reply via email to