Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric Mon, 12 Sep 2011 16:37:25 -0700

Hello again,

It looks like the config.outputs_to_working_directory variable is intended to 
do something closely related, but setting it to either of True and False does 
not in fact fix the problem.


The output path for files in a composite dataset upload (dataset.files_path) 
that is used in the tools/data_source/upload.xml tool is set to a path under 
the job working directory by lib/galaxy/tools/__init__.py:1519.  The preceding 
code (lines 1507-1516) select the path for the primary file contingent on 
config.outputs_to_working_directory.

Why is the path set in line 1519 not contingent on 
config.outputs_to_working_directory?  Indeed, the following small change fixes 
the bug I'm observing:

diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
--- a/lib/galaxy/tools/__init__.py      Mon Aug 29 14:42:04 2011 -0400
+++ b/lib/galaxy/tools/__init__.py      Mon Sep 12 19:32:26 2011 -0400
@@ -1516,7 +1516,9 @@
                 param_dict[name] = DatasetFilenameWrapper( hda )
             # Provide access to a path to store additional files
             # TODO: path munging for cluster/dataset server relocatability
-            param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
+            #param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
+            # This version should make it always follow the primary file
+            param_dict[name].files_path = os.path.abspath( os.path.join( 
os.path.split( param_dict[name].file_name )[0], "dataset_%s_files" % 
(hda.dataset.id) ))
             for child in hda.children:
                 param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) ] 
= DatasetFilenameWrapper( child )
         for out_name, output in self.outputs.iteritems():

Would this break anything?

If that cannot be changed, would the best solution be to modify the upload tool 
so that it took care of this on its own?  That seems readily doable, but starts 
to decentralize control of data flow policy.

Please advise.

Thanks,
Eric Paniagua
________________________________________
From: [email protected] [[email protected]] 
on behalf of Paniagua, Eric [[email protected]]
Sent: Monday, September 12, 2011 1:45 PM
To: [email protected]
Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi everyone,

I've been getting my feet wet with Galaxy development working to get some of 
the rexpression tools online, and I've run into a snag that I've traced back to 
a set_meta datatype method not being able to find a file from which it wants to 
extract metadata.  After reading the code, I believe this would also be a 
problem for non-composite datatypes.

The specific test case I've been looking at is uploading an affybatch file (and 
associated pheno file) using Galaxy's built-in upload tool and selecting the 
File Format manually (ie choosing "affybatch" in the dropdown).  I am using 
unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py 
and unmodified core Galaxy upload code as of 5955:949e4f5fa03a.  (I am also 
testing with modified versions, but I am able to reproduce and track this bug 
in the specified clean version).

The crux of the cause of error is that in JobWrapper.finish(), 
dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
composite dataset uploaded files are moved (in a call to a Tool method 
"self.tool.collect_associated_files(out_data, self.working_directory)" on line 
670) from the job working directory to the final destination under 
config.file_path (which defaults to "database/files").

In my test case, "database.set_meta( overwrite = False )" eventually calls 
lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I 
can tell, the only ways to construct a path to a file (or the file) in a 
dataset without using hard-coded paths from external knowledge is to use the 
Dataset.get_file_name or Dataset.extra_files_path properties.  Unless 
explicitly told otherwise, both of these methods construct a path based on the 
Dataset.file_path class data member, whose value is set during Galaxy startup 
to config.file_path (default "database/files").  However, at the time set_meta 
is called in this case, the files are not under config.file_path, but rather 
under the job working directory.  Attempting to open files from the dataset 
therefore fails when using these paths.  However, unless the job working 
directory is passed to set_meta or during construction of the underlying 
Dataset object, there doesn't appear to be a way for a Dataset method to access 
th!
 e currently running job (for instance to get its job ID or working directory). 
 (The second suggestion is actually not possible; since the standard upload is 
asynchronous, the Dataset object is created (and persisted) before the Job that 
will process it is created.)

Thoughts?  This issue affects Rexp.set_peek also, as well as any other 
functions that may want to read data from the uploaded files before they are 
moved to permanent location.  This is why if you have an affybatch file and its 
associated pheno file and you test this on, say, the public Galaxy server at 
http://main.g2.bx.psu.edu/ you'll see that the peek info says (for example): 
"##failed to find 
/galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno"

It seems that if the current way that Dataset.file_path, Dataset.file_name, and 
Dataset.extra_files_path is part of the desired design of Galaxy, that methods 
like set_meta should be run after the files have been moved to config.file_path 
so they can set metadata based on file contents.  It looks like this is 
intended to happen at least in some cases, from looking at 
lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not 
kicking in because hda_tool_output is None.

Any clarification on what's happening here, what's supposed to be happening for 
setting metadata on (potentially composite) uploads, why dataset.set_meta() 
isn't already being called after the files are moved to config.file_path, or 
any insights on related Galaxy design decisions I may not know about or design 
constraints I may have missed would be very greatly appreciated.

I'd also be glad to provide further detail or test files upon request.

Thank you,
Eric Paniagua

PS: Further notes on passing the job working directory to set_meta or set_peek 
- I have been successful modifying the code to do this for set_meta since the 
call chain starting from dataset.set_meta() in JobWrapper.finish() to 
Rexp.set_meta() accepts and forwards keyword argument dictionaries along the 
way.  However, set_peek does not accept arbitrary keyword arguments, making it 
harder to pass along the job working directory when needed without stepping on 
the toes of any other code.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Reply via email to