Thanks John, works fine.

-----Original Message-----
From: Burdett, Neil (ICT Centre, Herston - RBWH) 
Sent: Tuesday, 4 December 2012 9:57 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Cc: Szul, Piotr (ICT Centre, Marsfield)
Subject: RE: [galaxy-dev] pass more information on a dataset merge

Thanks Alex,
                    seems to work now so I checked in the code to our repository

From: [] On Behalf Of John Chilton 
Sent: Tuesday, December 04, 2012 4:26 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Cc: Burdett, Neil (ICT Centre, Herston - RBWH); Szul, Piotr (ICT Centre, 
Subject: Re: [galaxy-dev] pass more information on a dataset merge

Hey Alex,

Until I have bullied this stuff into galaxy-central, you should probably e-mail 
me directly and not the dev list. That said thanks for the heads up, that there 
was a definitely a bug. I pushed out this changeset to the bitbucket repository:

I should mention that I have sort of abandoned the bitbucket repository for 
this work in lieu of github, so that I can rebase as Galaxy changes and keep 
clean changesets.

Since I am posting this on the mailing list I might as well post a little 
summary of what has been done:

- For each datatype, an implicit multiple file version of that datatype is 
created. A new multiple upload tool/ftp directory tool has been implemented to 
create these.
- For any simple tool input you can chose a multiple file version of that input 
instead and then all outputs will become multiple file versions of the outputs. 
Uses task splitting stuff to distribute jobs across files.
- For multiple input tools, you can choose either multiple inputs individuals 
(no change there) or a single composite version.
Consistent interface for file path, display name, extension, etc... in tool 
- It should work with most existing tools and datatypes without change.
- Everything enabled with a single option in universe.ini

  - Makes workflows with arbitrary merging (and to a lesser extent
branching) and arbitrary number of input files possible.
  - Original base name is saved throughout analysis (when possible), so 
sample/replicate/fraction/lane/etc tracking is easier.

I started working on the metadata piece last night, once that is done I was 
planning on making a little demo video to post to this list to try to sell the 
3 outstanding small pull requests related to this work and the massive one that 
would follow those up :).


On Sun, Dec 2, 2012 at 8:52 PM,  <> wrote:
> Hi John,
> My colleague (Neil) has a bit of a problem with the multi file support:
> When I try and use the option "Upload Directory of files" I get the 
> error below
> Error Traceback:
> View as:   Interactive  |  Text  |  XML (full)
> ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
> URL:
> Module weberror.evalexception.middleware:364 in respond         view
>>>  app_iter = self.application(environ, detect_start_response)
> Module paste.debug.prints:98 in __call__         view
>>>  environ,
> Module paste.wsgilib:539 in intercept_output         view
>>>  app_iter = application(environ, replacement_start_response)
> Module paste.recursive:80 in __call__         view
>>>  return self.application(environ, start_response)
> Module paste.httpexceptions:632 in __call__         view
>>>  return self.application(environ, start_response)
> Module galaxy.web.framework.base:160 in __call__         view
>>>  body = method( trans, **kwargs )
> Module galaxy.web.controllers.library_common:855 in upload_library_dataset    
>      view
>>>  **kwd )
> Module galaxy.web.controllers.library_common:1055 in upload_dataset         
> view
>>>  json_file_path = upload_common.create_paramfile( trans, 
>>> uploaded_datasets )
> Module in create_paramfile         view
>>>  multifiles = uploaded_dataset.multifiles,
> AttributeError: 'Bunch' object has no attribute 'multifiles'
> Any ideas? Should we check if 'multifiles' attribute is set? Or some other 
> call is missing which should set it to NULL if it's missing?
> -Alex
> -----Original Message-----
> From: [] On Behalf Of 
> John Chilton
> Sent: Wednesday, 17 October 2012 3:21 AM
> To: Khassapov, Alex (CSIRO IM&T, Clayton)
> Subject: Re: [galaxy-dev] pass more information on a dataset merge
> Wow, thanks for the rapid feedback! I have made the changes you have 
> suggested. It seems you must be interested in this idea/implementation. Let 
> me know if you have specific use cases/requirements in mind and/or if you 
> would be interested in write access to the repository.
> -John
> On Mon, Oct 15, 2012 at 11:51 PM,  <> wrote:
>> Hi John,
>> I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
>> works great thank you (and Jorrit).
>> A couple of fixes:
>> 1. Add multi_upload.xml to too_conf.xml 2.
>> lib/galaxy/tools/parameters/ line 322 (in get_filenames( context 
>> )) -
>>         "if ftp_files is not None:"
>>    Remove "is not None" as ftp_files is empty [], but not None, then line 
>> 331 "user_ftp_dir = os.path.join(, 
>> )" throws an exeption if ftp_upload_dir isn't set.
>> Alex
>> -----Original Message-----
>> From:
>> [] On Behalf Of John 
>> Chilton
>> Sent: Tuesday, 16 October 2012 1:07 AM
>> To: Jorrit Boekel
>> Cc:
>> Subject: Re: [galaxy-dev] pass more information on a dataset merge
>> Here is an implementation of the implicit multi-file composite datatypes 
>> piece of that idea. I think the implicit parallelism may be harder.
>> t
>> atypes/compare
>> Jorrit do you have any objection to me trying to get this included in 
>> galaxy-central (this is 95% code I stole from you)? I made the changes 
>> against a clean galaxy-central fork and included nothing proteomics specific 
>> in anticipation of trying to do that. I have talked with Jim Johnson about 
>> the idea and he believes it would be useful his mothur metagenomics tools, 
>> so the idea is valuable outside of proteomics.
>> Galaxy team, would you be okay with including this and if so is there 
>> anything you would like to see either at a high level or at the level of the 
>> actual implementation.
>> -John
>> ------------------------------------------------
>> John Chilton
>> Senior Software Developer
>> University of Minnesota Supercomputing Institute
>> Office: 612-625-0917
>> Cell: 612-226-9223
>> Bitbucket:
>> Github:
>> Web:
>> On Mon, Oct 8, 2012 at 9:24 AM, John Chilton <> wrote:
>>> Jim Johnson and I have been discussing that approach to handling 
>>> fractionated proteomics samples as well (composite datatypes, not 
>>> the specifics of the interface for parallelizing).
>>> My perspective has been that Galaxy should be augmented with better 
>>> native mechanisms for grouping objects in histories, operating over 
>>> those groups, building workflows that involve arbitrary numbers of 
>>> inputs, etc... Composite data types are kindof a kludge, I think 
>>> they are more useful for grouping HTML files together when you don't 
>>> care about operating on the constituent parts you just want to view 
>>> pages a as a report or something. With this proteomic data we are 
>>> working with, the individual pieces are really interesting right? 
>>> You want to operate on the individual pieces with the full array of 
>>> tools (not just these special tools that have the logic for dealing 
>>> with the composite datatypes), you want to visualize the files, 
>>> etc... Putting these component pieces in the composite data type 
>>> extra_files path really limits what you can do with the pieces in Galaxy.
>>> I have a vague idea of something that I think could bridge some of 
>>> the gaps between the approaches (though I have no clue on the 
>>> feasibility). Looking through your implementation on bitbucket it 
>>> looks like you are defining your core datatypes (MS2, CruxSequest) 
>>> as subclasses of this composite data type (CompositeMultifile). My 
>>> recommendation would be to try to define plain datatypes for these 
>>> core datatype (MS2, CruxSequest) and then have the separate 
>>> composite datatype sort of delegate to the plain datatypes.
>>> You could then continue to explicitly declare subclasses of the 
>>> composite datatype (maybe MS2Set, CruxSequestSet), but also maybe 
>>> augement the tool xml so you can do implicit data type instances the 
>>> way you can with tabular data for instance (instead of defining 
>>> columns you would define the datatype to delegate to).
>>> The next step would be to make the parallelism implicit (i.e pull it 
>>> out of the tool wrapper). Your tool wrappers wouldn't reference the 
>>> composite datatypes, they would reference the simple datatypes, but 
>>> you could add a little icon next to any input that let you replace a 
>>> single input with a composite input for that type. It would be kind 
>>> of like the run workflow page where you can replace an input with a 
>>> multiple inputs. If a composite input (or inputs) are selected the 
>>> tool would then produce composite outputs.
>>> For the steps that actually combine multiple inputs, I think in your 
>>> case this is perculator maybe (a tool like interprophet or Scaffold 
>>> that merges peptide probabilities across runs and groups proteins), 
>>> then you could have the same sort of implicit replacement but 
>>> instead of for single inputs it could do that for multi-inputs 
>>> (assuming the Galaxy powers that be accept my fixes for multi-input tool 
>>> parameters:
>>> The upshot of all of that would be that then even if these 
>>> composites datatypes aren't used widely, other people could still 
>>> use your proteomics tools (my users are definitely interested in 
>>> Crux for
>>> instance) and you could then use other developers' proteomic tools 
>>> with your composite datatypes even though they weren't designed with 
>>> that use case in mind (I have msconvert, myrimatch, idpicker, 
>>> proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an 
>>> entire suite of label free quant tools). A third benefit would be 
>>> that people working in other -omicses could make use of the 
>>> homogenous composite datatype implementation without needing to 
>>> rewrite their wrappers and datatypes.
>>> There is probably something that I am missing that makes this very 
>>> difficult, let me know if you think this is a good idea and what its 
>>> feasibility might be. I forked your repo and set off to try to 
>>> implement some of this stuff last week and I ended up with my galaxy 
>>> pull requests to improve batching workflows and multi-input tool 
>>> parameters instead, but I hope to eventually get around to it.
>>> -John
>>> ------------------------------------------------
>>> John Chilton
>>> Senior Software Developer
>>> University of Minnesota Supercomputing Institute
>>> Office: 612-625-0917
>>> Cell: 612-226-9223
>>> Bitbucket:
>>> Github:
>>> Web:
>>> On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel 
>>> <> wrote:
>>>> Dear list,
>>>> I thought I was working with fairly large datasets, but they have 
>>>> recently started to include ~2Gb files in sets of >50. I have ran 
>>>> these sort of things before as merged data by using tar to roll 
>>>> them up in one set, but when dealing with >100Gb tarfiles, Galaxy 
>>>> on EC2 seems to get very slow, although that's probably because of 
>>>> my implementation of dataset type detection (untar and read through files).
>>>> Since tarring/untarring isn't very clean, I want to switch from 
>>>> tarring to creating composite files on merge by putting a tool's 
>>>> results into the dataset.extra_files_path. This doesn't seem to be 
>>>> supported yet, because we currently pass in do_merge the output 
>>>> dataset.filename to the respective datatype's merge method.
>>>> I would like to pass more data to the merge method (let's say the 
>>>> whole dataset object) to be able to get the composite files directory and 
>>>> 'merge'
>>>> the files in there. Good idea, bad idea? If anyone has views on 
>>>> this, I'd love to hear them.
>>>> cheers,
>>>> jorrit
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this and 
>>>> other Galaxy lists, please use the interface at:
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this and other Galaxy 
>> lists, please use the interface at:

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

Reply via email to