Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric Mon, 03 Oct 2011 11:35:05 -0700

Hi Dan,

Sure, here's the example where I discovered the bug (data files are not 
attached because of a 5MB limit on my email client; see 
http://main.g2.bx.psu.edu/u/paniag/h/metadata-bug-example for a history with 
the example dataset).  The datatype is AffyBatch (or probably anything derived 
from RexpBase) in lib/galaxy/datatypes/genetics.py.


  1.  On Galaxy main (http://main.g2.bx.psu.edu/) go to Get Data -> Upload File
  2.  Select "affybatch" under file format.
  3.  Choose the attached files for upload in their corresponding fields.
  4.  Select genome "mm9"
  5.  Hit Execute
  6.  Wait for job completion, which occurs successfully (ie end up green)
  7.  Click to expand new history item
  8.  Note that the peek box displays an error message

I further tested by trying to use the newly uploaded dataset with the reQC tool 
from the Rexpression library (I don't see it on the main site).  By looking 
into that code I realized metadata wasn't getting set, and (as detailed below) 
I tracked the problem through the upload tool, (local) tool runner, job 
wrapper, etc and discovered the metadata problem.

Best,
Eric

________________________________
From: Daniel Blankenberg [[email protected]]
Sent: Monday, October 03, 2011 2:00 PM
To: Paniagua, Eric
Cc: Nate Coraor; [email protected]
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi Eric,

Can you provide an example of the composite datatype that is experiencing the 
metadata problems?

The code and tool hooks are deprecated, but there could still be some instances 
where they may need to be used.  However, dynamic options (see eg "<options 
from_dataset="input1">"  from tools/filters/extract_GFF_Features.xml) can 
handle much of the functions where code files are used for population 
parameters. Additionally, tool output actions (see e.g. 
tools/filters/cutWrapper.xml) can be used to set metadata ( e.g. column 
assignments ) on tool outputs without using hooks. Please let us know if we can 
provide additional information.


Thanks for using Galaxy,

Dan


On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:

Hi Nate,

Thanks for your answers.  I will look into setting set_metadata_externally=True.

I've observed no impact (suggesting someone did error handling properly), but 
upload is the only tool I've tested it with so far.  I'll be doing more 
shortly, but going with the exec_after_process solution since I don't then need 
to modify core code here.

I'm glad to hear that abstraction layer is in the works.

Regarding the upload and general handling of compressed files, please refer to 
my response to Brent a short time ago, subject line "[galaxy-user] upload zip 
file to custom tool".  In particular, I pointed out that it doesn't work even 
with a manually set file type and why, at least in terms of code.  I was hoping 
you might know if it was that way on purpose and, if so, why.

Regarding deprecation of the <code> tag and of tool hooks (am I correct to 
understand that these are deprecated too?), there's been an example at the top 
of my mind (since I'm going to start working on this today).  In the abstract, 
the scenario is dynamically populating the tool UI based on a computation on 
the contents of one or more input datasets.  To an extent, the <conditional> 
and <page> tags are helpful with this, but without the <code> tag I'm not clear 
on how to call out to code that inspects the datasets and returns, for the sake 
of simplicity, options for a single parameter.

My specific example has to do with constructing a microarray expression 
analysis pipeline, with one of my starting points being Ross Lazarus's 
Rexpression code.  I may be able to get around the issue by storing the 
relevant information in metadata (in a custom datatype's set_meta).  Not sure 
yet.  Of course, that relies on metadata being handled correctly.  If I run 
into something more specific, I'll start a new thread.

Thanks,
Eric

________________________________________
From: Nate Coraor [[email protected]]
Sent: Friday, September 30, 2011 11:03 AM
To: Paniagua, Eric
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric wrote:
Hi Nate,

Thank you for your response!  I am glad that it was you in particular who did 
respond, because I also have some questions about the way the upload tool 
handles compressed files and saw that you have opened several Issues related to 
this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
further progress on the composite file issue.

As I mentioned in my original email, the trouble is that JobWrapper.finish() 
calls dataset.set_meta() before it calls collect_associated_files(), resulting 
dataset.extra_files_path being inaccurate because the files haven't been moved 
yet from the job working directory.  This is all with 
set_metadata_externally=False.  (I haven't worked with setting metadata 
externally yet, but I think it is worth verifying whether everything works 
correctly for the case I pointed out when set_metadata_externally=True.)

Since my last email, I poked around a bit more and found that my suggested 
short patch was not correct but incomplete.  The core problem is that component 
files are not moved with the primary file, so I changed that (patch attached, 
relative to { https://bitbucket.org/galaxy/galaxy-dist 5955:949e4f5fa03a }.  
Early in JobWrapper.finish() the primary file is moved from the working 
directory to the appropriate directory under config.file_path.  This patch uses 
the structure of the path naming convention to build the accurate path to the 
component files, and then moves them along with the primary file.  It's the 
least invasive (in terms of modifying Galaxy core code) potential fix I came up 
with, but since it relies explicitly on path structure and naming conventions I 
still think it's a bit of a hack.  However, it does seem to work, at least to 
me.  I don't have the resources or experience yet to attempt to test whether it 
negatively (or positively!) impacts others' use cases!
.  I do hope that someone more intimately familiar with the project would be 
able to vet it and make improvements / provide feedback / incorporate a fix for 
this issue to the Galaxy mainline.

However, since the group I'm working for operates downstream of the main Galaxy 
Development Team (primarily to add tools and maintain a local Galaxy server for 
our institution), we really try to avoid complications in merges when pulling 
down updates by avoiding changing Galaxy core code as much as possible.  
Consequently, I had to shift focus from finding the best correct fix to simply 
finding a workaround, and I found a very simple one.  However, it relies on the 
deprecated <code> tool config tag, so it will only work until support for the 
<code> tag is removed.  It's self-explanatory and also attached.

Hi Eric,

I haven't tried out the patch, but would this have an impact on
processing that occurs when collect_associated_files() runs since they
have been moved back already by your code?

I'd suggest enabling set_metadata_externally = True as I suspect this
will solve the problems since metadata will be set while the primary
file is still in the temporary location.  set_metadata_externally = True
should really be the default setting and we may remove internal metadata
setting entirely at some point.

Regarding potential fixes: Is there any reason why component files of a 
composite dataset should not always follow the primary file?  I don't know of 
one, but maybe there is some case in running Galaxy over a cluster where there 
is reason to do otherwise.  If the answer is no, it seems the best approach is 
to add an abstraction layer between the datasets and the file system to avoid 
various paths that might be associated to a dataset from falling out of sync.  
One (obviously motivational) function of this layer would be to move or copy 
datasets on the file system atomically.  This seems pretty important especially 
for enabling even more flexible support of distributed or clustered 
installations.

We're going to be working on just such an abstraction layer shortly.

Regarding the upload tool: I just saw your response to my other email on 
Leandro's thread.  Thanks for pointing me to the code responsible.  I've 
already forked a version of the upload tool for our Galaxy instance.  There are 
a lot of issues with the way the way the default upload tool works, and it 
sounds like you're aware of that and perhaps it's been a headache for you too.  
If it might help make progress more swift in the Galaxy mainline, I'd be glad 
to send you patches for updates and changes I make in the upload tool.  I do 
have at least one big question about it though: It currently attempts to sniff 
and uncompress files regardless of whether "Auto-detect" was selected for "File 
format"; is that by design or accident?  It seems that (particularly the bits 
try uncompressing files) causes headaches for a number of developers/users.  
I'm likely to remove that behavior from my version of the upload tool, but I'll 
be less inclined to do so if that will make it diverge si!
gnificantly from Galaxy's expected design path.

We definitely need to make it easier to upload compressed files that
intentionally stay compressed.  This should be the case when setting the
specific file type, though.

Finally, I noted that the <code> tag is deprecated.  I learned this from the 
Tool Config Syntax page on the Wiki 
(http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Ccode.3E_tag_set).
  It says there that it and the tool hooks (like exec_after_process) are being 
replaced, but doesn't give many details.  Is there anything you can tell me 
about what syntax or mechanism is planned to replace them?  The ability to 
generate UI parameters, have the tool page itself interact with other data 
sources and programs, use hooks to control tool execution, and generally 
interleave arbitrary code with tool/job runs is very useful.  Unfortunately, 
the tools page appears to be out of date or just not remotely comprehensive; 
I've found many examples of syntax in other tools' configurations that are 
simply not documented on the Tool Config Syntax page at all.

It's true that <code> is deprecated, although I can't provide much
useful information beyond that.  If you post up what you're using hooks
for one of my colleagues with more familiarity with tool configs can
probably reply with the best way to accomplish what you're doing without
hooks.

--nate


Please do let me know if you're able to doing anything (with or without my 
suggestions) about the issue I ran into with the way Galaxy handles composite 
datasets on the file system.  Or anything more generally about composite 
dataset design intentions, upload functionality, and known issues.  I'd greatly 
appreciate it.  Thanks for answering my email!


Thanks,
Eric


PS: I'm also unable to download composite datasets that I successfully 
uploaded; instead I get an error page saying the URL path doesn't exist.  
However, I haven't dug into this one enough to think it's not a local 
configuration problem on my end.  Do you recognize this as (related to) a known 
issue?

________________________________________
From: Nate Coraor [[email protected]]
Sent: Wednesday, September 28, 2011 9:24 AM
To: Paniagua, Eric
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric wrote:
Hi all,

Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
dataset_path.false_path to dataset_path.real_path (contingent on 
config.outputs_to_working_directory == True) but does not move the "extra 
files"?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you want 
to move a dataset, you want to move the whole dataset, and that perhaps this 
should be factored out, perhaps into the galaxy.util module?

Why does class DatasetPath only account for the path to the primary file and 
not the path to the "extra files"?  It could be used to account for the "extra 
files" by path splitting as in my previous suggested bug fix, but only if that 
fix is correct.  It doesn't seem to be used for that purpose in the Galaxy code.

I look forward to an informative response.

Hi Eric,

Sorry for the delay in responding, and thanks for your very detailed
digging into this problem.  To answer your above question, DatasetPath
only deals with the primary dataset because extra files are already
written to the working directory and moved back by the wrapper with
collect_associated_files().

Making it possible to read these extra files from the working directory
will be necessary when set_metadata_externally = True in the config,
although I am surprised this has been broken the whole time.  Have you
made progress past your last email?  I can pick up from wherever you've
left off.


Thanks,
Eric Paniagua

________________________________________
From: 
[email protected]<mailto:[email protected]> 
[[email protected]] on behalf of Paniagua, Eric 
[[email protected]]
Sent: Monday, September 12, 2011 7:37 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hello again,

It looks like the config.outputs_to_working_directory variable is intended to 
do something closely related, but setting it to either of True and False does 
not in fact fix the problem.

The output path for files in a composite dataset upload (dataset.files_path) 
that is used in the tools/data_source/upload.xml tool is set to a path under 
the job working directory by lib/galaxy/tools/__init__.py:1519.  The preceding 
code (lines 1507-1516) select the path for the primary file contingent on 
config.outputs_to_working_directory.

Why is the path set in line 1519 not contingent on 
config.outputs_to_working_directory?  Indeed, the following small change fixes 
the bug I'm observing:

diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
--- a/lib/galaxy/tools/__init__.py      Mon Aug 29 14:42:04 2011 -0400
+++ b/lib/galaxy/tools/__init__.py      Mon Sep 12 19:32:26 2011 -0400
@@ -1516,7 +1516,9 @@
                param_dict[name] = DatasetFilenameWrapper( hda )
            # Provide access to a path to store additional files
            # TODO: path munging for cluster/dataset server relocatability
-            param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
+            #param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
+            # This version should make it always follow the primary file
+            param_dict[name].files_path = os.path.abspath( os.path.join( 
os.path.split( param_dict[name].file_name )[0], "dataset_%s_files" % 
(hda.dataset.id) ))
            for child in hda.children:
                param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) ] 
= DatasetFilenameWrapper( child )
        for out_name, output in self.outputs.iteritems():

Would this break anything?

If that cannot be changed, would the best solution be to modify the upload tool 
so that it took care of this on its own?  That seems readily doable, but starts 
to decentralize control of data flow policy.

Please advise.

Thanks,
Eric Paniagua
________________________________________
From: 
[email protected]<mailto:[email protected]> 
[[email protected]] on behalf of Paniagua, Eric 
[[email protected]]
Sent: Monday, September 12, 2011 1:45 PM
To: [email protected]<mailto:[email protected]>
Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi everyone,

I've been getting my feet wet with Galaxy development working to get some of 
the rexpression tools online, and I've run into a snag that I've traced back to 
a set_meta datatype method not being able to find a file from which it wants to 
extract metadata.  After reading the code, I believe this would also be a 
problem for non-composite datatypes.

The specific test case I've been looking at is uploading an affybatch file (and 
associated pheno file) using Galaxy's built-in upload tool and selecting the 
File Format manually (ie choosing "affybatch" in the dropdown).  I am using 
unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py 
and unmodified core Galaxy upload code as of 5955:949e4f5fa03a.  (I am also 
testing with modified versions, but I am able to reproduce and track this bug 
in the specified clean version).

The crux of the cause of error is that in JobWrapper.finish(), 
dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
composite dataset uploaded files are moved (in a call to a Tool method 
"self.tool.collect_associated_files(out_data, self.working_directory)" on line 
670) from the job working directory to the final destination under 
config.file_path (which defaults to "database/files").

In my test case, "database.set_meta( overwrite = False )" eventually calls 
lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I 
can tell, the only ways to construct a path to a file (or the file) in a 
dataset without using hard-coded paths from external knowledge is to use the 
Dataset.get_file_name or Dataset.extra_files_path properties.  Unless 
explicitly told otherwise, both of these methods construct a path based on the 
Dataset.file_path class data member, whose value is set during Galaxy startup 
to config.file_path (default "database/files").  However, at the time set_meta 
is called in this case, the files are not under config.file_path, but rather 
under the job working directory.  Attempting to open files from the dataset 
therefore fails when using these paths.  However, unless the job working 
directory is passed to set_meta or during construction of the underlying 
Dataset object, there doesn't appear to be a way for a Dataset method to acces!
s th!
e currently running job (for instance to get its job ID or working directory).  
(The second suggestion is actually not possible; since the standard upload is 
asynchronous, the Dataset object is created (and persisted) before the Job that 
will process it is created.)

Thoughts?  This issue affects Rexp.set_peek also, as well as any other 
functions that may want to read data from the uploaded files before they are 
moved to permanent location.  This is why if you have an affybatch file and its 
associated pheno file and you test this on, say, the public Galaxy server at 
http://main.g2.bx.psu.edu/ you'll see that the peek info says (for example): 
"##failed to find 
/galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno"

It seems that if the current way that Dataset.file_path, Dataset.file_name, and 
Dataset.extra_files_path is part of the desired design of Galaxy, that methods 
like set_meta should be run after the files have been moved to config.file_path 
so they can set metadata based on file contents.  It looks like this is 
intended to happen at least in some cases, from looking at 
lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not 
kicking in because hda_tool_output is None.

Any clarification on what's happening here, what's supposed to be happening for 
setting metadata on (potentially composite) uploads, why dataset.set_meta() 
isn't already being called after the files are moved to config.file_path, or 
any insights on related Galaxy design decisions I may not know about or design 
constraints I may have missed would be very greatly appreciated.

I'd also be glad to provide further detail or test files upon request.

Thank you,
Eric Paniagua

PS: Further notes on passing the job working directory to set_meta or set_peek 
- I have been successful modifying the code to do this for set_meta since the 
call chain starting from dataset.set_meta() in JobWrapper.finish() to 
Rexp.set_meta() accepts and forwards keyword argument dictionaries along the 
way.  However, set_peek does not accept arbitrary keyword arguments, making it 
harder to pass along the job working directory when needed without stepping on 
the toes of any other code.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

# This file provides a workaround to the problem of premature set_meta calls in
# JobWrapper.finish(), which calls set_meta before the component files of a
# a composite dataset have been moved to their correct paths. This workaround
# exploits the fact that the exec_after_process hook is called after the 
component
# files have been moved.
#
# This problem affects tools that produce composite datasets in their output.
#
# To use this workaround, add the following to your tool's XML configuration
# (assuming it's in the same directory as this file):
#
#       <code file="composite_metadata_workaround.py"/>
#
# Also, composite types should set peek to start with '##failed' to be 
compatible.

# Retry set_peek and set_meta after the tool has collected all files
def exec_after_process( app, inp_data, out_data, param_dict,tool, stdout, 
stderr):
   for name, data in out_data.items():
       if data.peek.find( '##failed' ) == 0:
           data.set_meta( overwrite=False )
           data.set_peek( is_multi_byte=data.is_multi_byte() )
           app.model.context.add( data )
           app.model.context.flush()

diff -r -c -x '.hg*' psu_dist/lib/galaxy/jobs/__init__.py 
glxtemp/lib/galaxy/jobs/__init__.py
*** psu_dist/lib/galaxy/jobs/__init__.py      Fri Sep 16 11:03:48 2011
--- glxtemp/lib/galaxy/jobs/__init__.py       Wed Sep 28 13:58:28 2011
***************
*** 551,556 ****
--- 551,574 ----
                     else:
                         self.fail( "Job %s's output dataset(s) could not be 
read" % job.id )
                         return
+                 efp_fake = dataset_path.false_path.replace( '.dat', '_files' )
+                 efp_real = dataset_path.real_path.replace( '.dat', '_files' )
+                 try:
+                     shutil.move( efp_fake, efp_real )
+                     log.debug( "finish(): Moved %s to %s" % (efp_fake, 
efp_real) )
+                 except ( IOError, OSError ) as e:
+                     # this can happen if Galaxy is restarted during the job's
+                     # finish method - the false_path file has already moved,
+                     # and when the job is recovered, it won't be found.
+                     if os.path.exists( efp_fake ):
+                         log.warning( "finish(): Path %s does exist, but could 
not be moved: " % (efp_fake, e.strerror) )
+                     else:
+                         log.warning( "finish(): Path %s doesn't exist" )
+                     if os.path.exists( efp_real ) and os.path.isdir( efp_real 
):
+                         log.warning( "finish(): %s found, so it will be used 
instead" % ( efp_fake, efp_real ) )
+                     else:
+                         self.fail( "Job %s's output dataset(s) could not be 
read" % job.id )
+                         return
         job_context = ExpressionContext( dict( stdout = stdout, stderr = 
stderr ) )
         job_tool = self.app.toolbox.tools_by_id.get( job.tool_id, None )
         def in_directory( file, directory ):
diff -r -c -x '.hg*' psu_dist/lib/galaxy/tools/__init__.py 
glxtemp/lib/galaxy/tools/__init__.py
*** psu_dist/lib/galaxy/tools/__init__.py     Fri Sep 16 11:03:48 2011
--- glxtemp/lib/galaxy/tools/__init__.py      Wed Sep 28 13:58:44 2011
***************
*** 1516,1522 ****
                 param_dict[name] = DatasetFilenameWrapper( hda )
             # Provide access to a path to store additional files
             # TODO: path munging for cluster/dataset server relocatability
!             param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
             for child in hda.children:
                 param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) ] 
= DatasetFilenameWrapper( child )
         for out_name, output in self.outputs.iteritems():
--- 1516,1524 ----
                 param_dict[name] = DatasetFilenameWrapper( hda )
             # Provide access to a path to store additional files
             # TODO: path munging for cluster/dataset server relocatability
!             #param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, "dataset_%s_files" % (hda.dataset.id) ))
!             # This version should make it always follow the primary file
!             param_dict[name].files_path = os.path.abspath( os.path.join( 
os.path.split( param_dict[name].file_name )[0], os.path.split( 
param_dict[name].file_name )[1].replace( '.dat', '_files' ) ))
             for child in hda.children:
                 param_dict[ "_CHILD___%s___%s" % ( name, child.designation ) ] 
= DatasetFilenameWrapper( child )
         for out_name, output in self.outputs.iteritems():


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Reply via email to