Hey Alex,

Until I have bullied this stuff into galaxy-central, you should
probably e-mail me directly and not the dev list. That said thanks for
the heads up, that there was a definitely a bug. I pushed out this
changeset to the bitbucket repository:


I should mention that I have sort of abandoned the bitbucket
repository for this work in lieu of github, so that I can rebase as
Galaxy changes and keep clean changesets.


Since I am posting this on the mailing list I might as well post a
little summary of what has been done:

- For each datatype, an implicit multiple file version of that
datatype is created. A new multiple upload tool/ftp directory tool has
been implemented to create these.
- For any simple tool input you can chose a multiple file version of
that input instead and then all outputs will become multiple file
versions of the outputs. Uses task splitting stuff to distribute jobs
across files.
- For multiple input tools, you can choose either multiple inputs
individuals (no change there) or a single composite version.
Consistent interface for file path, display name, extension, etc... in
tool wrapper.
- It should work with most existing tools and datatypes without change.
- Everything enabled with a single option in universe.ini

  - Makes workflows with arbitrary merging (and to a lesser extent
branching) and arbitrary number of input files possible.
  - Original base name is saved throughout analysis (when possible),
so sample/replicate/fraction/lane/etc tracking is easier.

I started working on the metadata piece last night, once that is done
I was planning on making a little demo video to post to this list to
try to sell the 3 outstanding small pull requests related to this work
and the massive one that would follow those up :).


On Sun, Dec 2, 2012 at 8:52 PM,  <alex.khassa...@csiro.au> wrote:
> Hi John,
> My colleague (Neil) has a bit of a problem with the multi file support:
> When I try and use the option "Upload Directory of files" I get the error 
> below
> Error Traceback:
> View as:   Interactive  |  Text  |  XML (full)
> ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
> URL:
> Module weberror.evalexception.middleware:364 in respond         view
>>>  app_iter = self.application(environ, detect_start_response)
> Module paste.debug.prints:98 in __call__         view
>>>  environ, self.app)
> Module paste.wsgilib:539 in intercept_output         view
>>>  app_iter = application(environ, replacement_start_response)
> Module paste.recursive:80 in __call__         view
>>>  return self.application(environ, start_response)
> Module paste.httpexceptions:632 in __call__         view
>>>  return self.application(environ, start_response)
> Module galaxy.web.framework.base:160 in __call__         view
>>>  body = method( trans, **kwargs )
> Module galaxy.web.controllers.library_common:855 in upload_library_dataset    
>      view
>>>  **kwd )
> Module galaxy.web.controllers.library_common:1055 in upload_dataset         
> view
>>>  json_file_path = upload_common.create_paramfile( trans, uploaded_datasets )
> Module galaxy.tools.actions.upload_common:342 in create_paramfile         view
>>>  multifiles = uploaded_dataset.multifiles,
> AttributeError: 'Bunch' object has no attribute 'multifiles'
> Any ideas? Should we check if 'multifiles' attribute is set? Or some other 
> call is missing which should set it to NULL if it's missing?
> -Alex
> -----Original Message-----
> From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John 
> Chilton
> Sent: Wednesday, 17 October 2012 3:21 AM
> To: Khassapov, Alex (CSIRO IM&T, Clayton)
> Subject: Re: [galaxy-dev] pass more information on a dataset merge
> Wow, thanks for the rapid feedback! I have made the changes you have 
> suggested. It seems you must be interested in this idea/implementation. Let 
> me know if you have specific use cases/requirements in mind and/or if you 
> would be interested in write access to the repository.
> -John
> On Mon, Oct 15, 2012 at 11:51 PM,  <alex.khassa...@csiro.au> wrote:
>> Hi John,
>> I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
>> works great thank you (and Jorrit).
>> A couple of fixes:
>> 1. Add multi_upload.xml to too_conf.xml 2.
>> lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
>> )) -
>>         "if ftp_files is not None:"
>>    Remove "is not None" as ftp_files is empty [], but not None, then line 
>> 331 "user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, 
>> trans.user.email )" throws an exeption if ftp_upload_dir isn't set.
>> Alex
>> -----Original Message-----
>> From: galaxy-dev-boun...@lists.bx.psu.edu
>> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton
>> Sent: Tuesday, 16 October 2012 1:07 AM
>> To: Jorrit Boekel
>> Cc: galaxy-dev@lists.bx.psu.edu
>> Subject: Re: [galaxy-dev] pass more information on a dataset merge
>> Here is an implementation of the implicit multi-file composite datatypes 
>> piece of that idea. I think the implicit parallelism may be harder.
>> https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-dat
>> atypes/compare
>> Jorrit do you have any objection to me trying to get this included in 
>> galaxy-central (this is 95% code I stole from you)? I made the changes 
>> against a clean galaxy-central fork and included nothing proteomics specific 
>> in anticipation of trying to do that. I have talked with Jim Johnson about 
>> the idea and he believes it would be useful his mothur metagenomics tools, 
>> so the idea is valuable outside of proteomics.
>> Galaxy team, would you be okay with including this and if so is there 
>> anything you would like to see either at a high level or at the level of the 
>> actual implementation.
>> -John
>> ------------------------------------------------
>> John Chilton
>> Senior Software Developer
>> University of Minnesota Supercomputing Institute
>> Office: 612-625-0917
>> Cell: 612-226-9223
>> Bitbucket: https://bitbucket.org/jmchilton
>> Github: https://github.com/jmchilton
>> Web: http://jmchilton.net
>> On Mon, Oct 8, 2012 at 9:24 AM, John Chilton <chil...@msi.umn.edu> wrote:
>>> Jim Johnson and I have been discussing that approach to handling
>>> fractionated proteomics samples as well (composite datatypes, not the
>>> specifics of the interface for parallelizing).
>>> My perspective has been that Galaxy should be augmented with better
>>> native mechanisms for grouping objects in histories, operating over
>>> those groups, building workflows that involve arbitrary numbers of
>>> inputs, etc... Composite data types are kindof a kludge, I think they
>>> are more useful for grouping HTML files together when you don't care
>>> about operating on the constituent parts you just want to view pages
>>> a as a report or something. With this proteomic data we are working
>>> with, the individual pieces are really interesting right? You want to
>>> operate on the individual pieces with the full array of tools (not
>>> just these special tools that have the logic for dealing with the
>>> composite datatypes), you want to visualize the files, etc... Putting
>>> these component pieces in the composite data type extra_files path
>>> really limits what you can do with the pieces in Galaxy.
>>> I have a vague idea of something that I think could bridge some of
>>> the gaps between the approaches (though I have no clue on the
>>> feasibility). Looking through your implementation on bitbucket it
>>> looks like you are defining your core datatypes (MS2, CruxSequest) as
>>> subclasses of this composite data type (CompositeMultifile). My
>>> recommendation would be to try to define plain datatypes for these
>>> core datatype (MS2, CruxSequest) and then have the separate composite
>>> datatype sort of delegate to the plain datatypes.
>>> You could then continue to explicitly declare subclasses of the
>>> composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
>>> augement the tool xml so you can do implicit data type instances the
>>> way you can with tabular data for instance (instead of defining
>>> columns you would define the datatype to delegate to).
>>> The next step would be to make the parallelism implicit (i.e pull it
>>> out of the tool wrapper). Your tool wrappers wouldn't reference the
>>> composite datatypes, they would reference the simple datatypes, but
>>> you could add a little icon next to any input that let you replace a
>>> single input with a composite input for that type. It would be kind
>>> of like the run workflow page where you can replace an input with a
>>> multiple inputs. If a composite input (or inputs) are selected the
>>> tool would then produce composite outputs.
>>> For the steps that actually combine multiple inputs, I think in your
>>> case this is perculator maybe (a tool like interprophet or Scaffold
>>> that merges peptide probabilities across runs and groups proteins),
>>> then you could have the same sort of implicit replacement but instead
>>> of for single inputs it could do that for multi-inputs (assuming the
>>> Galaxy powers that be accept my fixes for multi-input tool parameters:
>>> https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool-parameter-fixes).
>>> The upshot of all of that would be that then even if these composites
>>> datatypes aren't used widely, other people could still use your
>>> proteomics tools (my users are definitely interested in Crux for
>>> instance) and you could then use other developers' proteomic tools
>>> with your composite datatypes even though they weren't designed with
>>> that use case in mind (I have msconvert, myrimatch, idpicker,
>>> proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an
>>> entire suite of label free quant tools). A third benefit would be
>>> that people working in other -omicses could make use of the
>>> homogenous composite datatype implementation without needing to
>>> rewrite their wrappers and datatypes.
>>> There is probably something that I am missing that makes this very
>>> difficult, let me know if you think this is a good idea and what its
>>> feasibility might be. I forked your repo and set off to try to
>>> implement some of this stuff last week and I ended up with my galaxy
>>> pull requests to improve batching workflows and multi-input tool
>>> parameters instead, but I hope to eventually get around to it.
>>> -John
>>> ------------------------------------------------
>>> John Chilton
>>> Senior Software Developer
>>> University of Minnesota Supercomputing Institute
>>> Office: 612-625-0917
>>> Cell: 612-226-9223
>>> Bitbucket: https://bitbucket.org/jmchilton
>>> Github: https://github.com/jmchilton
>>> Web: http://jmchilton.net
>>> On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel
>>> <jorrit.boe...@scilifelab.se> wrote:
>>>> Dear list,
>>>> I thought I was working with fairly large datasets, but they have
>>>> recently started to include ~2Gb files in sets of >50. I have ran
>>>> these sort of things before as merged data by using tar to roll them
>>>> up in one set, but when dealing with >100Gb tarfiles, Galaxy on EC2
>>>> seems to get very slow, although that's probably because of my
>>>> implementation of dataset type detection (untar and read through files).
>>>> Since tarring/untarring isn't very clean, I want to switch from
>>>> tarring to creating composite files on merge by putting a tool's
>>>> results into the dataset.extra_files_path. This doesn't seem to be
>>>> supported yet, because we currently pass in do_merge the output
>>>> dataset.filename to the respective datatype's merge method.
>>>> I would like to pass more data to the merge method (let's say the
>>>> whole dataset object) to be able to get the composite files directory and 
>>>> 'merge'
>>>> the files in there. Good idea, bad idea? If anyone has views on
>>>> this, I'd love to hear them.
>>>> cheers,
>>>> jorrit
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this and other
>>>> Galaxy lists, please use the interface at:
>>>>  http://lists.bx.psu.edu/
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this and other Galaxy 
>> lists, please use the interface at:
>>   http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to