Jim Johnson and I have been discussing that approach to handling
fractionated proteomics samples as well (composite datatypes, not the
specifics of the interface for parallelizing).
My perspective has been that Galaxy should be augmented with better
native mechanisms for grouping objects in histories, operating over
those groups, building workflows that involve arbitrary numbers of
inputs, etc... Composite data types are kindof a kludge, I think they
are more useful for grouping HTML files together when you don't care
about operating on the constituent parts you just want to view pages a
as a report or something. With this proteomic data we are working
with, the individual pieces are really interesting right? You want to
operate on the individual pieces with the full array of tools (not
just these special tools that have the logic for dealing with the
composite datatypes), you want to visualize the files, etc... Putting
these component pieces in the composite data type extra_files path
really limits what you can do with the pieces in Galaxy.
I have a vague idea of something that I think could bridge some of the
gaps between the approaches (though I have no clue on the
feasibility). Looking through your implementation on bitbucket it
looks like you are defining your core datatypes (MS2, CruxSequest) as
subclasses of this composite data type (CompositeMultifile). My
recommendation would be to try to define plain datatypes for these
core datatype (MS2, CruxSequest) and then have the separate composite
datatype sort of delegate to the plain datatypes.
You could then continue to explicitly declare subclasses of the
composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
augement the tool xml so you can do implicit data type instances the
way you can with tabular data for instance (instead of defining
columns you would define the datatype to delegate to).
The next step would be to make the parallelism implicit (i.e pull it
out of the tool wrapper). Your tool wrappers wouldn't reference the
composite datatypes, they would reference the simple datatypes, but
you could add a little icon next to any input that let you replace a
single input with a composite input for that type. It would be kind of
like the run workflow page where you can replace an input with a
multiple inputs. If a composite input (or inputs) are selected the
tool would then produce composite outputs.
For the steps that actually combine multiple inputs, I think in your
case this is perculator maybe (a tool like interprophet or Scaffold
that merges peptide probabilities across runs and groups proteins),
then you could have the same sort of implicit replacement but instead
of for single inputs it could do that for multi-inputs (assuming the
Galaxy powers that be accept my fixes for multi-input tool parameters:
The upshot of all of that would be that then even if these composites
datatypes aren't used widely, other people could still use your
proteomics tools (my users are definitely interested in Crux for
instance) and you could then use other developers' proteomic tools
with your composite datatypes even though they weren't designed with
that use case in mind (I have msconvert, myrimatch, idpicker,
proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an
entire suite of label free quant tools). A third benefit would be that
people working in other -omicses could make use of the homogenous
composite datatype implementation without needing to rewrite their
wrappers and datatypes.
There is probably something that I am missing that makes this very
difficult, let me know if you think this is a good idea and what its
feasibility might be. I forked your repo and set off to try to
implement some of this stuff last week and I ended up with my galaxy
pull requests to improve batching workflows and multi-input tool
parameters instead, but I hope to eventually get around to it.
Senior Software Developer
University of Minnesota Supercomputing Institute
On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel
I thought I was working with fairly large datasets, but they have recently
started to include ~2Gb files in sets of >50. I have ran these sort of
things before as merged data by using tar to roll them up in one set, but
when dealing with >100Gb tarfiles, Galaxy on EC2 seems to get very slow,
although that's probably because of my implementation of dataset type
detection (untar and read through files).
Since tarring/untarring isn't very clean, I want to switch from tarring to
creating composite files on merge by putting a tool's results into the
dataset.extra_files_path. This doesn't seem to be supported yet, because we
currently pass in do_merge the output dataset.filename to the respective
datatype's merge method.
I would like to pass more data to the merge method (let's say the whole
dataset object) to be able to get the composite files directory and 'merge'
the files in there. Good idea, bad idea? If anyone has views on this, I'd
love to hear them.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: