Re: [galaxy-dev] pass more information on a dataset merge

Alex.Khassapov Sun, 21 Oct 2012 23:44:07 -0700

1) One more question, 

My colleague likes the idea, but his composite data set dataset_id.dat file 
contains only a plain list of uploaded files, not HTML like yours.


I was wondering if it is possible to pass somehow a parameter to 
CompositeMultifile.regenerate_primary_file(dataset) to switch between HTML and 
plain list formats. I mean add a 'hidden' parameter in toll.xml file, but I'm 
not sure how to get these tool parameters in Galaxy source?

2) And one more question, I use your "m:xxx" format for the tool output, all 
files are generated in the dataset_id_files folder, but the dataset_id.dat file 
is empty. To force creation of the dataset.id file I use the exec_after_process 
hook (<code file="xxxx.py"/> tag):
    for key,val in out_data.items():
        try:
            if not hasattr(val.dataset, "name"):
                val.dataset.name = val.dataset.file_name
            val.datatype.regenerate_primary_file(val.dataset)
        except Exception as e:
            print "######## ERROR: " + str(e)

But it doesn't feel right, I wonder what is the proper way to use "m:xxx" 
format for the output?

-Alex

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of John Chilton
Sent: Saturday, 20 October 2012 1:40 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge

Hey Alex,

I think the idea here is that your initially uploaded files would have 
different names, but after Jorrit's tool split/merge step they will all just be 
named after the dataset id (see screenshot) so you need the task_X at the end 
so they don't all just have the same name.

I have not thought a whole lot about the naming thing, in general it seems like 
a tough problem and one that Galaxy itself doesn't do a particularly good job 
at.

Jorrit have you given any thought to this?

I wonder if it would be feasible to use the initial uploaded name as a sort of 
prefix going forward. So if I upload say

fraction1.RAW
fraction2.RAW
fraction3.RAW

and run a conversion step, maybe I could get:

fraction1_dataset567.ms2
fraction2_dataset567.ms2
fraction3_dataset567.ms2

instead of

dataset567.dat_task_0
dataset567.dat_task_1
dataset567.dat_task_2

Jorrit do you mind if I give implementing that a shot? It seems like it would 
be a win to me. Am I am going to hit some problem I don't see now (presumable 
we have to send some data from the split to the merge and that might be tricky)?

-John

On Thu, Oct 18, 2012 at 7:00 PM,  <[email protected]> wrote:
> Thanks John,
>
> I wonder what's the reason for appending _task_XX to the file names, why 
> can't we just keep original file names?
>
> Alex
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of 
> John Chilton
> Sent: Friday, 19 October 2012 6:16 AM
> To: Khassapov, Alex (CSIRO IM&T, Clayton)
> Subject: Re: [galaxy-dev] pass more information on a dataset merge
>
> On Tue, Oct 16, 2012 at 11:11 PM,  <[email protected]> wrote:
>> Hi John,
>>
>> I am definitely interested in this idea, not only me - we are currently 
>> working on moving a few scientific tools (not related to genome) into cloud 
>> using Galaxy.
>
> Great. My interests in Galaxy are mostly outside of genomics as well, it is 
> good to have more people utilizing Galaxy in this way because it will force 
> the platform to become more generic and address more broader use cases.
>
>>
>> We will try it further and see if we need any changes. For now one 
>> improvement would be nice, make dataset_id.dat contain list of paths to the 
>> location of the uploaded files, so by displaying html page the user could 
>> just click on the link and download the file.
>>
>
> Code that attempted to do this was in there, but didn't work obviously. I 
> have now fixed it up.
>
> Thanks for beta testing.
>
> -John
>
>> We are pretty new to Galaxy, so our understanding of Galaxy is pretty 
>> limited.
>>
>> Thanks again,
>>
>> Alex
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of 
>> John Chilton
>> Sent: Wednesday, 17 October 2012 3:21 AM
>> To: Khassapov, Alex (CSIRO IM&T, Clayton)
>> Subject: Re: [galaxy-dev] pass more information on a dataset merge
>>
>> Wow, thanks for the rapid feedback! I have made the changes you have 
>> suggested. It seems you must be interested in this idea/implementation. Let 
>> me know if you have specific use cases/requirements in mind and/or if you 
>> would be interested in write access to the repository.
>>
>> -John
>>
>> On Mon, Oct 15, 2012 at 11:51 PM,  <[email protected]> wrote:
>>> Hi John,
>>>
>>> I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
>>> works great thank you (and Jorrit).
>>>
>>> A couple of fixes:
>>>
>>> 1. Add multi_upload.xml to too_conf.xml 2.
>>> lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
>>> )) -
>>>         "if ftp_files is not None:"
>>>    Remove "is not None" as ftp_files is empty [], but not None, then line 
>>> 331 "user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, 
>>> trans.user.email )" throws an exeption if ftp_upload_dir isn't set.
>>>
>>> Alex
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of John 
>>> Chilton
>>> Sent: Tuesday, 16 October 2012 1:07 AM
>>> To: Jorrit Boekel
>>> Cc: [email protected]
>>> Subject: Re: [galaxy-dev] pass more information on a dataset merge
>>>
>>> Here is an implementation of the implicit multi-file composite datatypes 
>>> piece of that idea. I think the implicit parallelism may be harder.
>>>
>>> https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-d
>>> a
>>> t
>>> atypes/compare
>>>
>>> Jorrit do you have any objection to me trying to get this included in 
>>> galaxy-central (this is 95% code I stole from you)? I made the changes 
>>> against a clean galaxy-central fork and included nothing proteomics 
>>> specific in anticipation of trying to do that. I have talked with Jim 
>>> Johnson about the idea and he believes it would be useful his mothur 
>>> metagenomics tools, so the idea is valuable outside of proteomics.
>>>
>>> Galaxy team, would you be okay with including this and if so is there 
>>> anything you would like to see either at a high level or at the level of 
>>> the actual implementation.
>>>
>>> -John
>>>
>>> ------------------------------------------------
>>> John Chilton
>>> Senior Software Developer
>>> University of Minnesota Supercomputing Institute
>>> Office: 612-625-0917
>>> Cell: 612-226-9223
>>> Bitbucket: https://bitbucket.org/jmchilton
>>> Github: https://github.com/jmchilton
>>> Web: http://jmchilton.net
>>>
>>> On Mon, Oct 8, 2012 at 9:24 AM, John Chilton <[email protected]> wrote:
>>>> Jim Johnson and I have been discussing that approach to handling 
>>>> fractionated proteomics samples as well (composite datatypes, not 
>>>> the specifics of the interface for parallelizing).
>>>>
>>>> My perspective has been that Galaxy should be augmented with better 
>>>> native mechanisms for grouping objects in histories, operating over 
>>>> those groups, building workflows that involve arbitrary numbers of 
>>>> inputs, etc... Composite data types are kindof a kludge, I think 
>>>> they are more useful for grouping HTML files together when you 
>>>> don't care about operating on the constituent parts you just want 
>>>> to view pages a as a report or something. With this proteomic data 
>>>> we are working with, the individual pieces are really interesting right?
>>>> You want to operate on the individual pieces with the full array of 
>>>> tools (not just these special tools that have the logic for dealing 
>>>> with the composite datatypes), you want to visualize the files, 
>>>> etc... Putting these component pieces in the composite data type 
>>>> extra_files path really limits what you can do with the pieces in Galaxy.
>>>>
>>>> I have a vague idea of something that I think could bridge some of 
>>>> the gaps between the approaches (though I have no clue on the 
>>>> feasibility). Looking through your implementation on bitbucket it 
>>>> looks like you are defining your core datatypes (MS2, CruxSequest) 
>>>> as subclasses of this composite data type (CompositeMultifile). My 
>>>> recommendation would be to try to define plain datatypes for these 
>>>> core datatype (MS2, CruxSequest) and then have the separate 
>>>> composite datatype sort of delegate to the plain datatypes.
>>>>
>>>> You could then continue to explicitly declare subclasses of the 
>>>> composite datatype (maybe MS2Set, CruxSequestSet), but also maybe 
>>>> augement the tool xml so you can do implicit data type instances 
>>>> the way you can with tabular data for instance (instead of defining 
>>>> columns you would define the datatype to delegate to).
>>>>
>>>> The next step would be to make the parallelism implicit (i.e pull 
>>>> it out of the tool wrapper). Your tool wrappers wouldn't reference 
>>>> the composite datatypes, they would reference the simple datatypes, 
>>>> but you could add a little icon next to any input that let you 
>>>> replace a single input with a composite input for that type. It 
>>>> would be kind of like the run workflow page where you can replace 
>>>> an input with a multiple inputs. If a composite input (or inputs) 
>>>> are selected the tool would then produce composite outputs.
>>>>
>>>> For the steps that actually combine multiple inputs, I think in 
>>>> your case this is perculator maybe (a tool like interprophet or 
>>>> Scaffold that merges peptide probabilities across runs and groups 
>>>> proteins), then you could have the same sort of implicit 
>>>> replacement but instead of for single inputs it could do that for 
>>>> multi-inputs (assuming the Galaxy powers that be accept my fixes for 
>>>> multi-input tool parameters:
>>>> https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool-parameter-fixes).
>>>>
>>>> The upshot of all of that would be that then even if these 
>>>> composites datatypes aren't used widely, other people could still 
>>>> use your proteomics tools (my users are definitely interested in 
>>>> Crux for
>>>> instance) and you could then use other developers' proteomic tools 
>>>> with your composite datatypes even though they weren't designed 
>>>> with that use case in mind (I have msconvert, myrimatch, idpicker, 
>>>> proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an 
>>>> entire suite of label free quant tools). A third benefit would be 
>>>> that people working in other -omicses could make use of the 
>>>> homogenous composite datatype implementation without needing to 
>>>> rewrite their wrappers and datatypes.
>>>>
>>>> There is probably something that I am missing that makes this very 
>>>> difficult, let me know if you think this is a good idea and what 
>>>> its feasibility might be. I forked your repo and set off to try to 
>>>> implement some of this stuff last week and I ended up with my 
>>>> galaxy pull requests to improve batching workflows and multi-input 
>>>> tool parameters instead, but I hope to eventually get around to it.
>>>>
>>>> -John
>>>>
>>>> ------------------------------------------------
>>>> John Chilton
>>>> Senior Software Developer
>>>> University of Minnesota Supercomputing Institute
>>>> Office: 612-625-0917
>>>> Cell: 612-226-9223
>>>> Bitbucket: https://bitbucket.org/jmchilton
>>>> Github: https://github.com/jmchilton
>>>> Web: http://jmchilton.net
>>>>
>>>> On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel 
>>>> <[email protected]> wrote:
>>>>> Dear list,
>>>>>
>>>>> I thought I was working with fairly large datasets, but they have 
>>>>> recently started to include ~2Gb files in sets of >50. I have ran 
>>>>> these sort of things before as merged data by using tar to roll 
>>>>> them up in one set, but when dealing with >100Gb tarfiles, Galaxy 
>>>>> on EC2 seems to get very slow, although that's probably because of 
>>>>> my implementation of dataset type detection (untar and read through 
>>>>> files).
>>>>>
>>>>> Since tarring/untarring isn't very clean, I want to switch from 
>>>>> tarring to creating composite files on merge by putting a tool's 
>>>>> results into the dataset.extra_files_path. This doesn't seem to be 
>>>>> supported yet, because we currently pass in do_merge the output 
>>>>> dataset.filename to the respective datatype's merge method.
>>>>>
>>>>> I would like to pass more data to the merge method (let's say the 
>>>>> whole dataset object) to be able to get the composite files directory and 
>>>>> 'merge'
>>>>> the files in there. Good idea, bad idea? If anyone has views on 
>>>>> this, I'd love to hear them.
>>>>>
>>>>> cheers,
>>>>> jorrit
>>>>>
>>>>> ___________________________________________________________
>>>>> Please keep all replies on the list by using "reply all"
>>>>> in your mail client.  To manage your subscriptions to this and 
>>>>> other Galaxy lists, please use the interface at:
>>>>>
>>>>>  http://lists.bx.psu.edu/
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this and other Galaxy 
>>> lists, please use the interface at:
>>>
>>>   http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] pass more information on a dataset merge

Reply via email to