Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Nate Coraor Wed, 28 Sep 2011 06:25:07 -0700

Paniagua, Eric wrote:
> Hi all,
> 
> Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
> dataset_path.false_path to dataset_path.real_path (contingent on 
> config.outputs_to_working_directory == True) but does not move the "extra 
> files"?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you 
> want to move a dataset, you want to move the whole dataset, and that perhaps 
> this should be factored out, perhaps into the galaxy.util module?
> 
> Why does class DatasetPath only account for the path to the primary file and 
> not the path to the "extra files"?  It could be used to account for the 
> "extra files" by path splitting as in my previous suggested bug fix, but only 
> if that fix is correct.  It doesn't seem to be used for that purpose in the 
> Galaxy code.
> 
> I look forward to an informative response.


Hi Eric,

Sorry for the delay in responding, and thanks for your very detailed
digging into this problem.  To answer your above question, DatasetPath
only deals with the primary dataset because extra files are already
written to the working directory and moved back by the wrapper with
collect_associated_files().

Making it possible to read these extra files from the working directory
will be necessary when set_metadata_externally = True in the config,
although I am surprised this has been broken the whole time.  Have you
made progress past your last email?  I can pick up from wherever you've
left off.

> 
> Thanks,
> Eric Paniagua
> 
> ________________________________________
> From: [email protected] 
> [[email protected]] on behalf of Paniagua, Eric 
> [[email protected]]
> Sent: Monday, September 12, 2011 7:37 PM
> To: [email protected]
> Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Hello again,
> 
> It looks like the config.outputs_to_working_directory variable is intended to 
> do something closely related, but setting it to either of True and False does 
> not in fact fix the problem.
> 
> The output path for files in a composite dataset upload (dataset.files_path) 
> that is used in the tools/data_source/upload.xml tool is set to a path under 
> the job working directory by lib/galaxy/tools/__init__.py:1519.  The 
> preceding code (lines 1507-1516) select the path for the primary file 
> contingent on config.outputs_to_working_directory.
> 
> Why is the path set in line 1519 not contingent on 
> config.outputs_to_working_directory?  Indeed, the following small change 
> fixes the bug I'm observing:
> 
> diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
> --- a/lib/galaxy/tools/__init__.py      Mon Aug 29 14:42:04 2011 -0400
> +++ b/lib/galaxy/tools/__init__.py      Mon Sep 12 19:32:26 2011 -0400
> @@ -1516,7 +1516,9 @@
>                  param_dict[name] = DatasetFilenameWrapper( hda )
>              # Provide access to a path to store additional files
>              # TODO: path munging for cluster/dataset server relocatability
> -            param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
> +            #param_dict[name].files_path = os.path.abspath(os.path.join( 
> job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
> +            # This version should make it always follow the primary file
> +            param_dict[name].files_path = os.path.abspath( os.path.join( 
> os.path.split( param_dict[name].file_name )[0], "dataset_%s_files" % 
> (hda.dataset.id) ))
>              for child in hda.children:
>                  param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) 
> ] = DatasetFilenameWrapper( child )
>          for out_name, output in self.outputs.iteritems():
> 
> Would this break anything?
> 
> If that cannot be changed, would the best solution be to modify the upload 
> tool so that it took care of this on its own?  That seems readily doable, but 
> starts to decentralize control of data flow policy.
> 
> Please advise.
> 
> Thanks,
> Eric Paniagua
> ________________________________________
> From: [email protected] 
> [[email protected]] on behalf of Paniagua, Eric 
> [[email protected]]
> Sent: Monday, September 12, 2011 1:45 PM
> To: [email protected]
> Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
> 
> Hi everyone,
> 
> I've been getting my feet wet with Galaxy development working to get some of 
> the rexpression tools online, and I've run into a snag that I've traced back 
> to a set_meta datatype method not being able to find a file from which it 
> wants to extract metadata.  After reading the code, I believe this would also 
> be a problem for non-composite datatypes.
> 
> The specific test case I've been looking at is uploading an affybatch file 
> (and associated pheno file) using Galaxy's built-in upload tool and selecting 
> the File Format manually (ie choosing "affybatch" in the dropdown).  I am 
> using unmodified datatype definitions provided in 
> lib/galaxy/datatypes/genetics.py and unmodified core Galaxy upload code as of 
> 5955:949e4f5fa03a.  (I am also testing with modified versions, but I am able 
> to reproduce and track this bug in the specified clean version).
> 
> The crux of the cause of error is that in JobWrapper.finish(), 
> dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
> composite dataset uploaded files are moved (in a call to a Tool method 
> "self.tool.collect_associated_files(out_data, self.working_directory)" on 
> line 670) from the job working directory to the final destination under 
> config.file_path (which defaults to "database/files").
> 
> In my test case, "database.set_meta( overwrite = False )" eventually calls 
> lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I 
> can tell, the only ways to construct a path to a file (or the file) in a 
> dataset without using hard-coded paths from external knowledge is to use the 
> Dataset.get_file_name or Dataset.extra_files_path properties.  Unless 
> explicitly told otherwise, both of these methods construct a path based on 
> the Dataset.file_path class data member, whose value is set during Galaxy 
> startup to config.file_path (default "database/files").  However, at the time 
> set_meta is called in this case, the files are not under config.file_path, 
> but rather under the job working directory.  Attempting to open files from 
> the dataset therefore fails when using these paths.  However, unless the job 
> working directory is passed to set_meta or during construction of the 
> underlying Dataset object, there doesn't appear to be a way for a Dataset 
> method to access !
 th!
>  e currently running job (for instance to get its job ID or working 
> directory).  (The second suggestion is actually not possible; since the 
> standard upload is asynchronous, the Dataset object is created (and 
> persisted) before the Job that will process it is created.)
> 
> Thoughts?  This issue affects Rexp.set_peek also, as well as any other 
> functions that may want to read data from the uploaded files before they are 
> moved to permanent location.  This is why if you have an affybatch file and 
> its associated pheno file and you test this on, say, the public Galaxy server 
> at http://main.g2.bx.psu.edu/ you'll see that the peek info says (for 
> example): "##failed to find 
> /galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno"
> 
> It seems that if the current way that Dataset.file_path, Dataset.file_name, 
> and Dataset.extra_files_path is part of the desired design of Galaxy, that 
> methods like set_meta should be run after the files have been moved to 
> config.file_path so they can set metadata based on file contents.  It looks 
> like this is intended to happen at least in some cases, from looking at 
> lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not 
> kicking in because hda_tool_output is None.
> 
> Any clarification on what's happening here, what's supposed to be happening 
> for setting metadata on (potentially composite) uploads, why 
> dataset.set_meta() isn't already being called after the files are moved to 
> config.file_path, or any insights on related Galaxy design decisions I may 
> not know about or design constraints I may have missed would be very greatly 
> appreciated.
> 
> I'd also be glad to provide further detail or test files upon request.
> 
> Thank you,
> Eric Paniagua
> 
> PS: Further notes on passing the job working directory to set_meta or 
> set_peek - I have been successful modifying the code to do this for set_meta 
> since the call chain starting from dataset.set_meta() in JobWrapper.finish() 
> to Rexp.set_meta() accepts and forwards keyword argument dictionaries along 
> the way.  However, set_peek does not accept arbitrary keyword arguments, 
> making it harder to pass along the job working directory when needed without 
> stepping on the toes of any other code.
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>   http://lists.bx.psu.edu/
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>   http://lists.bx.psu.edu/
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>   http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Reply via email to