Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Daniel Blankenberg Mon, 03 Oct 2011 12:38:23 -0700

Hi Eric,

Documentation for tool output actions is limited but available from 
http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Cactions.3E_tag_set
 and the code is located at lib/galaxy/tools/parameters/output.py. The 
LiftOver, Cut, BWA, fimo, tophat, ccat, srma, mosaik, bfast, sam to bam, and 
Bowtie tools also currently use them.



Thanks for using Galaxy,

Dan


On Oct 3, 2011, at 2:51 PM, Paniagua, Eric wrote:

> Hi Dan,
> 
> Regarding the tool output actions, could you point me to any documentation or 
> additional good examples for handling tool output post-processing?
> 
> Thanks,
> Eric
> 
> ________________________________
> From: Daniel Blankenberg [[email protected]]
> Sent: Monday, October 03, 2011 2:00 PM
> To: Paniagua, Eric
> Cc: Nate Coraor; [email protected]
> Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Hi Eric,
> 
> Can you provide an example of the composite datatype that is experiencing the 
> metadata problems?
> 
> The code and tool hooks are deprecated, but there could still be some 
> instances where they may need to be used.  However, dynamic options (see eg 
> "<options from_dataset="input1">"  from 
> tools/filters/extract_GFF_Features.xml) can handle much of the functions 
> where code files are used for population parameters. Additionally, tool 
> output actions (see e.g. tools/filters/cutWrapper.xml) can be used to set 
> metadata ( e.g. column assignments ) on tool outputs without using hooks. 
> Please let us know if we can provide additional information.
> 
> 
> Thanks for using Galaxy,
> 
> Dan
> 
> 
> On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:
> 
> Hi Nate,
> 
> Thanks for your answers.  I will look into setting 
> set_metadata_externally=True.
> 
> I've observed no impact (suggesting someone did error handling properly), but 
> upload is the only tool I've tested it with so far.  I'll be doing more 
> shortly, but going with the exec_after_process solution since I don't then 
> need to modify core code here.
> 
> I'm glad to hear that abstraction layer is in the works.
> 
> Regarding the upload and general handling of compressed files, please refer 
> to my response to Brent a short time ago, subject line "[galaxy-user] upload 
> zip file to custom tool".  In particular, I pointed out that it doesn't work 
> even with a manually set file type and why, at least in terms of code.  I was 
> hoping you might know if it was that way on purpose and, if so, why.
> 
> Regarding deprecation of the <code> tag and of tool hooks (am I correct to 
> understand that these are deprecated too?), there's been an example at the 
> top of my mind (since I'm going to start working on this today).  In the 
> abstract, the scenario is dynamically populating the tool UI based on a 
> computation on the contents of one or more input datasets.  To an extent, the 
> <conditional> and <page> tags are helpful with this, but without the <code> 
> tag I'm not clear on how to call out to code that inspects the datasets and 
> returns, for the sake of simplicity, options for a single parameter.
> 
> My specific example has to do with constructing a microarray expression 
> analysis pipeline, with one of my starting points being Ross Lazarus's 
> Rexpression code.  I may be able to get around the issue by storing the 
> relevant information in metadata (in a custom datatype's set_meta).  Not sure 
> yet.  Of course, that relies on metadata being handled correctly.  If I run 
> into something more specific, I'll start a new thread.
> 
> Thanks,
> Eric
> 
> ________________________________________
> From: Nate Coraor [[email protected]]
> Sent: Friday, September 30, 2011 11:03 AM
> To: Paniagua, Eric
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Paniagua, Eric wrote:
> Hi Nate,
> 
> Thank you for your response!  I am glad that it was you in particular who did 
> respond, because I also have some questions about the way the upload tool 
> handles compressed files and saw that you have opened several Issues related 
> to this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
> further progress on the composite file issue.
> 
> As I mentioned in my original email, the trouble is that JobWrapper.finish() 
> calls dataset.set_meta() before it calls collect_associated_files(), 
> resulting dataset.extra_files_path being inaccurate because the files haven't 
> been moved yet from the job working directory.  This is all with 
> set_metadata_externally=False.  (I haven't worked with setting metadata 
> externally yet, but I think it is worth verifying whether everything works 
> correctly for the case I pointed out when set_metadata_externally=True.)
> 
> Since my last email, I poked around a bit more and found that my suggested 
> short patch was not correct but incomplete.  The core problem is that 
> component files are not moved with the primary file, so I changed that (patch 
> attached, relative to { https://bitbucket.org/galaxy/galaxy-dist 
> 5955:949e4f5fa03a }.  Early in JobWrapper.finish() the primary file is moved 
> from the working directory to the appropriate directory under 
> config.file_path.  This patch uses the structure of the path naming 
> convention to build the accurate path to the component files, and then moves 
> them along with the primary file.  It's the least invasive (in terms of 
> modifying Galaxy core code) potential fix I came up with, but since it relies 
> explicitly on path structure and naming conventions I still think it's a bit 
> of a hack.  However, it does seem to work, at least to me.  I don't have the 
> resources or experience yet to attempt to test whether it negatively (or 
> positively!) impacts others' use cases!
> .  I do hope that someone more intimately familiar with the project would be 
> able to vet it and make improvements / provide feedback / incorporate a fix 
> for this issue to the Galaxy mainline.
> 
> However, since the group I'm working for operates downstream of the main 
> Galaxy Development Team (primarily to add tools and maintain a local Galaxy 
> server for our institution), we really try to avoid complications in merges 
> when pulling down updates by avoiding changing Galaxy core code as much as 
> possible.  Consequently, I had to shift focus from finding the best correct 
> fix to simply finding a workaround, and I found a very simple one.  However, 
> it relies on the deprecated <code> tool config tag, so it will only work 
> until support for the <code> tag is removed.  It's self-explanatory and also 
> attached.
> 
> Hi Eric,
> 
> I haven't tried out the patch, but would this have an impact on
> processing that occurs when collect_associated_files() runs since they
> have been moved back already by your code?
> 
> I'd suggest enabling set_metadata_externally = True as I suspect this
> will solve the problems since metadata will be set while the primary
> file is still in the temporary location.  set_metadata_externally = True
> should really be the default setting and we may remove internal metadata
> setting entirely at some point.
> 
> Regarding potential fixes: Is there any reason why component files of a 
> composite dataset should not always follow the primary file?  I don't know of 
> one, but maybe there is some case in running Galaxy over a cluster where 
> there is reason to do otherwise.  If the answer is no, it seems the best 
> approach is to add an abstraction layer between the datasets and the file 
> system to avoid various paths that might be associated to a dataset from 
> falling out of sync.  One (obviously motivational) function of this layer 
> would be to move or copy datasets on the file system atomically.  This seems 
> pretty important especially for enabling even more flexible support of 
> distributed or clustered installations.
> 
> We're going to be working on just such an abstraction layer shortly.
> 
> Regarding the upload tool: I just saw your response to my other email on 
> Leandro's thread.  Thanks for pointing me to the code responsible.  I've 
> already forked a version of the upload tool for our Galaxy instance.  There 
> are a lot of issues with the way the way the default upload tool works, and 
> it sounds like you're aware of that and perhaps it's been a headache for you 
> too.  If it might help make progress more swift in the Galaxy mainline, I'd 
> be glad to send you patches for updates and changes I make in the upload 
> tool.  I do have at least one big question about it though: It currently 
> attempts to sniff and uncompress files regardless of whether "Auto-detect" 
> was selected for "File format"; is that by design or accident?  It seems that 
> (particularly the bits try uncompressing files) causes headaches for a number 
> of developers/users.  I'm likely to remove that behavior from my version of 
> the upload tool, but I'll be less inclined to do so if that will make it 
> diverge si!
> gnificantly from Galaxy's expected design path.
> 
> We definitely need to make it easier to upload compressed files that
> intentionally stay compressed.  This should be the case when setting the
> specific file type, though.
> 
> Finally, I noted that the <code> tag is deprecated.  I learned this from the 
> Tool Config Syntax page on the Wiki 
> (http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Ccode.3E_tag_set).
>   It says there that it and the tool hooks (like exec_after_process) are 
> being replaced, but doesn't give many details.  Is there anything you can 
> tell me about what syntax or mechanism is planned to replace them?  The 
> ability to generate UI parameters, have the tool page itself interact with 
> other data sources and programs, use hooks to control tool execution, and 
> generally interleave arbitrary code with tool/job runs is very useful.  
> Unfortunately, the tools page appears to be out of date or just not remotely 
> comprehensive; I've found many examples of syntax in other tools' 
> configurations that are simply not documented on the Tool Config Syntax page 
> at all.
> 
> It's true that <code> is deprecated, although I can't provide much
> useful information beyond that.  If you post up what you're using hooks
> for one of my colleagues with more familiarity with tool configs can
> probably reply with the best way to accomplish what you're doing without
> hooks.
> 
> --nate
> 
> 
> Please do let me know if you're able to doing anything (with or without my 
> suggestions) about the issue I ran into with the way Galaxy handles composite 
> datasets on the file system.  Or anything more generally about composite 
> dataset design intentions, upload functionality, and known issues.  I'd 
> greatly appreciate it.  Thanks for answering my email!
> 
> 
> Thanks,
> Eric
> 
> 
> PS: I'm also unable to download composite datasets that I successfully 
> uploaded; instead I get an error page saying the URL path doesn't exist.  
> However, I haven't dug into this one enough to think it's not a local 
> configuration problem on my end.  Do you recognize this as (related to) a 
> known issue?
> 
> ________________________________________
> From: Nate Coraor [[email protected]]
> Sent: Wednesday, September 28, 2011 9:24 AM
> To: Paniagua, Eric
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Paniagua, Eric wrote:
> Hi all,
> 
> Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
> dataset_path.false_path to dataset_path.real_path (contingent on 
> config.outputs_to_working_directory == True) but does not move the "extra 
> files"?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you 
> want to move a dataset, you want to move the whole dataset, and that perhaps 
> this should be factored out, perhaps into the galaxy.util module?
> 
> Why does class DatasetPath only account for the path to the primary file and 
> not the path to the "extra files"?  It could be used to account for the 
> "extra files" by path splitting as in my previous suggested bug fix, but only 
> if that fix is correct.  It doesn't seem to be used for that purpose in the 
> Galaxy code.
> 
> I look forward to an informative response.
> 
> Hi Eric,
> 
> Sorry for the delay in responding, and thanks for your very detailed
> digging into this problem.  To answer your above question, DatasetPath
> only deals with the primary dataset because extra files are already
> written to the working directory and moved back by the wrapper with
> collect_associated_files().
> 
> Making it possible to read these extra files from the working directory
> will be necessary when set_metadata_externally = True in the config,
> although I am surprised this has been broken the whole time.  Have you
> made progress past your last email?  I can pick up from wherever you've
> left off.
> 
> 
> Thanks,
> Eric Paniagua
> 
> ________________________________________
> From: 
> [email protected]<mailto:[email protected]>
>  [[email protected]] on behalf of Paniagua, Eric 
> [[email protected]]
> Sent: Monday, September 12, 2011 7:37 PM
> To: [email protected]<mailto:[email protected]>
> Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Hello again,
> 
> It looks like the config.outputs_to_working_directory variable is intended to 
> do something closely related, but setting it to either of True and False does 
> not in fact fix the problem.
> 
> The output path for files in a composite dataset upload (dataset.files_path) 
> that is used in the tools/data_source/upload.xml tool is set to a path under 
> the job working directory by lib/galaxy/tools/__init__.py:1519.  The 
> preceding code (lines 1507-1516) select the path for the primary file 
> contingent on config.outputs_to_working_directory.
> 
> Why is the path set in line 1519 not contingent on 
> config.outputs_to_working_directory?  Indeed, the following small change 
> fixes the bug I'm observing:
> 
> diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
> --- a/lib/galaxy/tools/__init__.py      Mon Aug 29 14:42:04 2011 -0400
> +++ b/lib/galaxy/tools/__init__.py      Mon Sep 12 19:32:26 2011 -0400
> @@ -1516,7 +1516,9 @@
>                param_dict[name] = DatasetFilenameWrapper( hda )
>            # Provide access to a path to store additional files
>            # TODO: path munging for cluster/dataset server relocatability
> -            param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
> +            #param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
> +            # This version should make it always follow the primary file
> +            param_dict[name].files_path = os.path.abspath( os.path.join( 
> os.path.split( param_dict[name].file_name )[0], "dataset_%s_files" % 
> (hda.dataset.id) ))
>            for child in hda.children:
>                param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) ] 
> = DatasetFilenameWrapper( child )
>        for out_name, output in self.outputs.iteritems():
> 
> Would this break anything?
> 
> If that cannot be changed, would the best solution be to modify the upload 
> tool so that it took care of this on its own?  That seems readily doable, but 
> starts to decentralize control of data flow policy.
> 
> Please advise.
> 
> Thanks,
> Eric Paniagua
> ________________________________________
> From: 
> [email protected]<mailto:[email protected]>
>  [[email protected]] on behalf of Paniagua, Eric 
> [[email protected]]
> Sent: Monday, September 12, 2011 1:45 PM
> To: [email protected]<mailto:[email protected]>
> Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Hi everyone,
> 
> I've been getting my feet wet with Galaxy development working to get some of 
> the rexpression tools online, and I've run into a snag that I've traced back 
> to a set_meta datatype method not being able to find a file from which it 
> wants to extract metadata.  After reading the code, I believe this would also 
> be a problem for non-composite datatypes.
> 
> The specific test case I've been looking at is uploading an affybatch file 
> (and associated pheno file) using Galaxy's built-in upload tool and selecting 
> the File Format manually (ie choosing "affybatch" in the dropdown).  I am 
> using unmodified datatype definitions provided in 
> lib/galaxy/datatypes/genetics.py and unmodified core Galaxy upload code as of 
> 5955:949e4f5fa03a.  (I am also testing with modified versions, but I am able 
> to reproduce and track this bug in the specified clean version).
> 
> The crux of the cause of error is that in JobWrapper.finish(), 
> dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
> composite dataset uploaded files are moved (in a call to a Tool method 
> "self.tool.collect_associated_files(out_data, self.working_directory)" on 
> line 670) from the job working directory to the final destination under 
> config.file_path (which defaults to "database/files").
> 
> In my test case, "database.set_meta( overwrite = False )" eventually calls 
> lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I 
> can tell, the only ways to construct a path to a file (or the file) in a 
> dataset without using hard-coded paths from external knowledge is to use the 
> Dataset.get_file_name or Dataset.extra_files_path properties.  Unless 
> explicitly told otherwise, both of these methods construct a path based on 
> the Dataset.file_path class data member, whose value is set during Galaxy 
> startup to config.file_path (default "database/files").  However, at the time 
> set_meta is called in this case, the files are not under config.file_path, 
> but rather under the job working directory.  Attempting to open files from 
> the dataset therefore fails when using these paths.  However, unless the job 
> working directory is passed to set_meta or during construction of the 
> underlying Dataset object, there doesn't appear to be a way for a Dataset 
> method to acces!
> s th!
> e currently running job (for instance to get its job ID or working 
> directory).  (The second suggestion is actually not possible; since the 
> standard upload is asynchronous, the Dataset object is created (and 
> persisted) before the Job that will process it is created.)
> 
> Thoughts?  This issue affects Rexp.set_peek also, as well as any other 
> functions that may want to read data from the uploaded files before they are 
> moved to permanent location.  This is why if you have an affybatch file and 
> its associated pheno file and you test this on, say, the public Galaxy server 
> at http://main.g2.bx.psu.edu/ you'll see that the peek info says (for 
> example): "##failed to find 
> /galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno"
> 
> It seems that if the current way that Dataset.file_path, Dataset.file_name, 
> and Dataset.extra_files_path is part of the desired design of Galaxy, that 
> methods like set_meta should be run after the files have been moved to 
> config.file_path so they can set metadata based on file contents.  It looks 
> like this is intended to happen at least in some cases, from looking at 
> lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not 
> kicking in because hda_tool_output is None.
> 
> Any clarification on what's happening here, what's supposed to be happening 
> for setting metadata on (potentially composite) uploads, why 
> dataset.set_meta() isn't already being called after the files are moved to 
> config.file_path, or any insights on related Galaxy design decisions I may 
> not know about or design constraints I may have missed would be very greatly 
> appreciated.
> 
> I'd also be glad to provide further detail or test files upon request.
> 
> Thank you,
> Eric Paniagua
> 
> PS: Further notes on passing the job working directory to set_meta or 
> set_peek - I have been successful modifying the code to do this for set_meta 
> since the call chain starting from dataset.set_meta() in JobWrapper.finish() 
> to Rexp.set_meta() accepts and forwards keyword argument dictionaries along 
> the way.  However, set_peek does not accept arbitrary keyword arguments, 
> making it harder to pass along the job working directory when needed without 
> stepping on the toes of any other code.
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 
> # This file provides a workaround to the problem of premature set_meta calls 
> in
> # JobWrapper.finish(), which calls set_meta before the component files of a
> # a composite dataset have been moved to their correct paths. This workaround
> # exploits the fact that the exec_after_process hook is called after the 
> component
> # files have been moved.
> #
> # This problem affects tools that produce composite datasets in their output.
> #
> # To use this workaround, add the following to your tool's XML configuration
> # (assuming it's in the same directory as this file):
> #
> #       <code file="composite_metadata_workaround.py"/>
> #
> # Also, composite types should set peek to start with '##failed' to be 
> compatible.
> 
> # Retry set_peek and set_meta after the tool has collected all files
> def exec_after_process( app, inp_data, out_data, param_dict,tool, stdout, 
> stderr):
>   for name, data in out_data.items():
>       if data.peek.find( '##failed' ) == 0:
>           data.set_meta( overwrite=False )
>           data.set_peek( is_multi_byte=data.is_multi_byte() )
>           app.model.context.add( data )
>           app.model.context.flush()
> 
> diff -r -c -x '.hg*' psu_dist/lib/galaxy/jobs/__init__.py 
> glxtemp/lib/galaxy/jobs/__init__.py
> *** psu_dist/lib/galaxy/jobs/__init__.py      Fri Sep 16 11:03:48 2011
> --- glxtemp/lib/galaxy/jobs/__init__.py       Wed Sep 28 13:58:28 2011
> ***************
> *** 551,556 ****
> --- 551,574 ----
>                     else:
>                         self.fail( "Job %s's output dataset(s) could not be 
> read" % job.id )
>                         return
> +                 efp_fake = dataset_path.false_path.replace( '.dat', 
> '_files' )
> +                 efp_real = dataset_path.real_path.replace( '.dat', '_files' 
> )
> +                 try:
> +                     shutil.move( efp_fake, efp_real )
> +                     log.debug( "finish(): Moved %s to %s" % (efp_fake, 
> efp_real) )
> +                 except ( IOError, OSError ) as e:
> +                     # this can happen if Galaxy is restarted during the 
> job's
> +                     # finish method - the false_path file has already moved,
> +                     # and when the job is recovered, it won't be found.
> +                     if os.path.exists( efp_fake ):
> +                         log.warning( "finish(): Path %s does exist, but 
> could not be moved: " % (efp_fake, e.strerror) )
> +                     else:
> +                         log.warning( "finish(): Path %s doesn't exist" )
> +                     if os.path.exists( efp_real ) and os.path.isdir( 
> efp_real ):
> +                         log.warning( "finish(): %s found, so it will be 
> used instead" % ( efp_fake, efp_real ) )
> +                     else:
> +                         self.fail( "Job %s's output dataset(s) could not be 
> read" % job.id )
> +                         return
>         job_context = ExpressionContext( dict( stdout = stdout, stderr = 
> stderr ) )
>         job_tool = self.app.toolbox.tools_by_id.get( job.tool_id, None )
>         def in_directory( file, directory ):
> diff -r -c -x '.hg*' psu_dist/lib/galaxy/tools/__init__.py 
> glxtemp/lib/galaxy/tools/__init__.py
> *** psu_dist/lib/galaxy/tools/__init__.py     Fri Sep 16 11:03:48 2011
> --- glxtemp/lib/galaxy/tools/__init__.py      Wed Sep 28 13:58:44 2011
> ***************
> *** 1516,1522 ****
>                 param_dict[name] = DatasetFilenameWrapper( hda )
>             # Provide access to a path to store additional files
>             # TODO: path munging for cluster/dataset server relocatability
> !             param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
>             for child in hda.children:
>                 param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) 
> ] = DatasetFilenameWrapper( child )
>         for out_name, output in self.outputs.iteritems():
> --- 1516,1524 ----
>                 param_dict[name] = DatasetFilenameWrapper( hda )
>             # Provide access to a path to store additional files
>             # TODO: path munging for cluster/dataset server relocatability
> !             #param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
> !             # This version should make it always follow the primary file
> !             param_dict[name].files_path = os.path.abspath( os.path.join( 
> os.path.split( param_dict[name].file_name )[0], os.path.split( 
> param_dict[name].file_name )[1].replace( '.dat', '_files' ) ))
>             for child in hda.children:
>                 param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) 
> ] = DatasetFilenameWrapper( child )
>         for out_name, output in self.outputs.iteritems():
> 
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
> http://lists.bx.psu.edu/
> 
> 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Reply via email to