I've been getting my feet wet with Galaxy development working to get some of
the rexpression tools online, and I've run into a snag that I've traced back to
a set_meta datatype method not being able to find a file from which it wants to
extract metadata. After reading the code, I believe this would also be a
problem for non-composite datatypes.
The specific test case I've been looking at is uploading an affybatch file (and
associated pheno file) using Galaxy's built-in upload tool and selecting the
File Format manually (ie choosing "affybatch" in the dropdown). I am using
unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py
and unmodified core Galaxy upload code as of 5955:949e4f5fa03a. (I am also
testing with modified versions, but I am able to reproduce and track this bug
in the specified clean version).
The crux of the cause of error is that in JobWrapper.finish(),
dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the
composite dataset uploaded files are moved (in a call to a Tool method
"self.tool.collect_associated_files(out_data, self.working_directory)" on line
670) from the job working directory to the final destination under
config.file_path (which defaults to "database/files").
In my test case, "database.set_meta( overwrite = False )" eventually calls
lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd). As far as I
can tell, the only ways to construct a path to a file (or the file) in a
dataset without using hard-coded paths from external knowledge is to use the
Dataset.get_file_name or Dataset.extra_files_path properties. Unless
explicitly told otherwise, both of these methods construct a path based on the
Dataset.file_path class data member, whose value is set during Galaxy startup
to config.file_path (default "database/files"). However, at the time set_meta
is called in this case, the files are not under config.file_path, but rather
under the job working directory. Attempting to open files from the dataset
therefore fails when using these paths. However, unless the job working
directory is passed to set_meta or during construction of the underlying
Dataset object, there doesn't appear to be a way for a Dataset method to access
e currently running job (for instance to get its job ID or working directory).
(The second suggestion is actually not possible; since the standard upload is
asynchronous, the Dataset object is created (and persisted) before the Job that
will process it is created.)
Thoughts? This issue affects Rexp.set_peek also, as well as any other
functions that may want to read data from the uploaded files before they are
moved to permanent location. This is why if you have an affybatch file and its
associated pheno file and you test this on, say, the public Galaxy server at
http://main.g2.bx.psu.edu/ you'll see that the peek info says (for example):
"##failed to find
It seems that if the current way that Dataset.file_path, Dataset.file_name, and
Dataset.extra_files_path is part of the desired design of Galaxy, that methods
like set_meta should be run after the files have been moved to config.file_path
so they can set metadata based on file contents. It looks like this is
intended to happen at least in some cases, from looking at
lib/galaxy/jobs/__init__.py:568-586. However, in my tests this code is not
kicking in because hda_tool_output is None.
Any clarification on what's happening here, what's supposed to be happening for
setting metadata on (potentially composite) uploads, why dataset.set_meta()
isn't already being called after the files are moved to config.file_path, or
any insights on related Galaxy design decisions I may not know about or design
constraints I may have missed would be very greatly appreciated.
I'd also be glad to provide further detail or test files upon request.
PS: Further notes on passing the job working directory to set_meta or set_peek
- I have been successful modifying the code to do this for set_meta since the
call chain starting from dataset.set_meta() in JobWrapper.finish() to
Rexp.set_meta() accepts and forwards keyword argument dictionaries along the
way. However, set_peek does not accept arbitrary keyword arguments, making it
harder to pass along the job working directory when needed without stepping on
the toes of any other code.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: