Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-10-03 Thread Nate Coraor
Paniagua, Eric wrote:
 Hi Nate,
 
 Thanks for your answers.  I will look into setting 
 set_metadata_externally=True.
 
 I've observed no impact (suggesting someone did error handling properly), but 
 upload is the only tool I've tested it with so far.  I'll be doing more 
 shortly, but going with the exec_after_process solution since I don't then 
 need to modify core code here.
 
 I'm glad to hear that abstraction layer is in the works.
 
 Regarding the upload and general handling of compressed files, please refer 
 to my response to Brent a short time ago, subject line [galaxy-user] upload 
 zip file to custom tool.  In particular, I pointed out that it doesn't work 
 even with a manually set file type and why, at least in terms of code.  I was 
 hoping you might know if it was that way on purpose and, if so, why.

It's not really that way on purpose, you should be able to define a
filetype and bypass the unarchiving, but as you've discovered, this
can't be done without modifying the code.  This is essentially a bug.

There are two issues to track related to fixing this:

https://bitbucket.org/galaxy/galaxy-central/issue/280/allow-zip-file-uploads


https://bitbucket.org/galaxy/galaxy-central/issue/304/make-binary-datatype-uploads-pluggable

Thanks
--nate

 
 Regarding deprecation of the code tag and of tool hooks (am I correct to 
 understand that these are deprecated too?), there's been an example at the 
 top of my mind (since I'm going to start working on this today).  In the 
 abstract, the scenario is dynamically populating the tool UI based on a 
 computation on the contents of one or more input datasets.  To an extent, the 
 conditional and page tags are helpful with this, but without the code 
 tag I'm not clear on how to call out to code that inspects the datasets and 
 returns, for the sake of simplicity, options for a single parameter.
 
 My specific example has to do with constructing a microarray expression 
 analysis pipeline, with one of my starting points being Ross Lazarus's 
 Rexpression code.  I may be able to get around the issue by storing the 
 relevant information in metadata (in a custom datatype's set_meta).  Not sure 
 yet.  Of course, that relies on metadata being handled correctly.  If I run 
 into something more specific, I'll start a new thread.
 
 Thanks,
 Eric
 
 
 From: Nate Coraor [n...@bx.psu.edu]
 Sent: Friday, September 30, 2011 11:03 AM
 To: Paniagua, Eric
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Paniagua, Eric wrote:
  Hi Nate,
 
  Thank you for your response!  I am glad that it was you in particular who 
  did respond, because I also have some questions about the way the upload 
  tool handles compressed files and saw that you have opened several Issues 
  related to this on the Galaxy bitbucket site.  First though, I'll fill you 
  in on my further progress on the composite file issue.
 
  As I mentioned in my original email, the trouble is that 
  JobWrapper.finish() calls dataset.set_meta() before it calls 
  collect_associated_files(), resulting dataset.extra_files_path being 
  inaccurate because the files haven't been moved yet from the job working 
  directory.  This is all with set_metadata_externally=False.  (I haven't 
  worked with setting metadata externally yet, but I think it is worth 
  verifying whether everything works correctly for the case I pointed out 
  when set_metadata_externally=True.)
 
  Since my last email, I poked around a bit more and found that my suggested 
  short patch was not correct but incomplete.  The core problem is that 
  component files are not moved with the primary file, so I changed that 
  (patch attached, relative to { https://bitbucket.org/galaxy/galaxy-dist 
  5955:949e4f5fa03a }.  Early in JobWrapper.finish() the primary file is 
  moved from the working directory to the appropriate directory under 
  config.file_path.  This patch uses the structure of the path naming 
  convention to build the accurate path to the component files, and then 
  moves them along with the primary file.  It's the least invasive (in terms 
  of modifying Galaxy core code) potential fix I came up with, but since it 
  relies explicitly on path structure and naming conventions I still think 
  it's a bit of a hack.  However, it does seem to work, at least to me.  I 
  don't have the resources or experience yet to attempt to test whether it 
  negatively (or positively!) impacts others' use cas!
 es.  I do hope that someone more intimately familiar with the project would be 
able to vet it and make improvements / provide feedback / incorporate a fix for 
this issue to the Galaxy mainline.
 
  However, since the group I'm working for operates downstream of the main 
  Galaxy Development Team (primarily to add tools and maintain a local Galaxy 
  server for our institution), we really try to avoid complications in merges

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-10-03 Thread Paniagua, Eric
Hi Dan,

Sure, here's the example where I discovered the bug (data files are not 
attached because of a 5MB limit on my email client; see 
http://main.g2.bx.psu.edu/u/paniag/h/metadata-bug-example for a history with 
the example dataset).  The datatype is AffyBatch (or probably anything derived 
from RexpBase) in lib/galaxy/datatypes/genetics.py.

  1.  On Galaxy main (http://main.g2.bx.psu.edu/) go to Get Data - Upload File
  2.  Select affybatch under file format.
  3.  Choose the attached files for upload in their corresponding fields.
  4.  Select genome mm9
  5.  Hit Execute
  6.  Wait for job completion, which occurs successfully (ie end up green)
  7.  Click to expand new history item
  8.  Note that the peek box displays an error message

I further tested by trying to use the newly uploaded dataset with the reQC tool 
from the Rexpression library (I don't see it on the main site).  By looking 
into that code I realized metadata wasn't getting set, and (as detailed below) 
I tracked the problem through the upload tool, (local) tool runner, job 
wrapper, etc and discovered the metadata problem.

Best,
Eric


From: Daniel Blankenberg [d...@bx.psu.edu]
Sent: Monday, October 03, 2011 2:00 PM
To: Paniagua, Eric
Cc: Nate Coraor; galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi Eric,

Can you provide an example of the composite datatype that is experiencing the 
metadata problems?

The code and tool hooks are deprecated, but there could still be some instances 
where they may need to be used.  However, dynamic options (see eg options 
from_dataset=input1  from tools/filters/extract_GFF_Features.xml) can 
handle much of the functions where code files are used for population 
parameters. Additionally, tool output actions (see e.g. 
tools/filters/cutWrapper.xml) can be used to set metadata ( e.g. column 
assignments ) on tool outputs without using hooks. Please let us know if we can 
provide additional information.


Thanks for using Galaxy,

Dan


On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:

Hi Nate,

Thanks for your answers.  I will look into setting set_metadata_externally=True.

I've observed no impact (suggesting someone did error handling properly), but 
upload is the only tool I've tested it with so far.  I'll be doing more 
shortly, but going with the exec_after_process solution since I don't then need 
to modify core code here.

I'm glad to hear that abstraction layer is in the works.

Regarding the upload and general handling of compressed files, please refer to 
my response to Brent a short time ago, subject line [galaxy-user] upload zip 
file to custom tool.  In particular, I pointed out that it doesn't work even 
with a manually set file type and why, at least in terms of code.  I was hoping 
you might know if it was that way on purpose and, if so, why.

Regarding deprecation of the code tag and of tool hooks (am I correct to 
understand that these are deprecated too?), there's been an example at the top 
of my mind (since I'm going to start working on this today).  In the abstract, 
the scenario is dynamically populating the tool UI based on a computation on 
the contents of one or more input datasets.  To an extent, the conditional 
and page tags are helpful with this, but without the code tag I'm not clear 
on how to call out to code that inspects the datasets and returns, for the sake 
of simplicity, options for a single parameter.

My specific example has to do with constructing a microarray expression 
analysis pipeline, with one of my starting points being Ross Lazarus's 
Rexpression code.  I may be able to get around the issue by storing the 
relevant information in metadata (in a custom datatype's set_meta).  Not sure 
yet.  Of course, that relies on metadata being handled correctly.  If I run 
into something more specific, I'll start a new thread.

Thanks,
Eric


From: Nate Coraor [n...@bx.psu.edu]
Sent: Friday, September 30, 2011 11:03 AM
To: Paniagua, Eric
Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric wrote:
Hi Nate,

Thank you for your response!  I am glad that it was you in particular who did 
respond, because I also have some questions about the way the upload tool 
handles compressed files and saw that you have opened several Issues related to 
this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
further progress on the composite file issue.

As I mentioned in my original email, the trouble is that JobWrapper.finish() 
calls dataset.set_meta() before it calls collect_associated_files(), resulting 
dataset.extra_files_path being inaccurate because the files haven't been moved 
yet from the job working directory.  This is all with 
set_metadata_externally=False.  (I haven't worked with setting metadata 
externally yet

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-10-03 Thread Paniagua, Eric
Hi Dan,

Regarding the tool output actions, could you point me to any documentation or 
additional good examples for handling tool output post-processing?

Thanks,
Eric


From: Daniel Blankenberg [d...@bx.psu.edu]
Sent: Monday, October 03, 2011 2:00 PM
To: Paniagua, Eric
Cc: Nate Coraor; galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi Eric,

Can you provide an example of the composite datatype that is experiencing the 
metadata problems?

The code and tool hooks are deprecated, but there could still be some instances 
where they may need to be used.  However, dynamic options (see eg options 
from_dataset=input1  from tools/filters/extract_GFF_Features.xml) can 
handle much of the functions where code files are used for population 
parameters. Additionally, tool output actions (see e.g. 
tools/filters/cutWrapper.xml) can be used to set metadata ( e.g. column 
assignments ) on tool outputs without using hooks. Please let us know if we can 
provide additional information.


Thanks for using Galaxy,

Dan


On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:

Hi Nate,

Thanks for your answers.  I will look into setting set_metadata_externally=True.

I've observed no impact (suggesting someone did error handling properly), but 
upload is the only tool I've tested it with so far.  I'll be doing more 
shortly, but going with the exec_after_process solution since I don't then need 
to modify core code here.

I'm glad to hear that abstraction layer is in the works.

Regarding the upload and general handling of compressed files, please refer to 
my response to Brent a short time ago, subject line [galaxy-user] upload zip 
file to custom tool.  In particular, I pointed out that it doesn't work even 
with a manually set file type and why, at least in terms of code.  I was hoping 
you might know if it was that way on purpose and, if so, why.

Regarding deprecation of the code tag and of tool hooks (am I correct to 
understand that these are deprecated too?), there's been an example at the top 
of my mind (since I'm going to start working on this today).  In the abstract, 
the scenario is dynamically populating the tool UI based on a computation on 
the contents of one or more input datasets.  To an extent, the conditional 
and page tags are helpful with this, but without the code tag I'm not clear 
on how to call out to code that inspects the datasets and returns, for the sake 
of simplicity, options for a single parameter.

My specific example has to do with constructing a microarray expression 
analysis pipeline, with one of my starting points being Ross Lazarus's 
Rexpression code.  I may be able to get around the issue by storing the 
relevant information in metadata (in a custom datatype's set_meta).  Not sure 
yet.  Of course, that relies on metadata being handled correctly.  If I run 
into something more specific, I'll start a new thread.

Thanks,
Eric


From: Nate Coraor [n...@bx.psu.edu]
Sent: Friday, September 30, 2011 11:03 AM
To: Paniagua, Eric
Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric wrote:
Hi Nate,

Thank you for your response!  I am glad that it was you in particular who did 
respond, because I also have some questions about the way the upload tool 
handles compressed files and saw that you have opened several Issues related to 
this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
further progress on the composite file issue.

As I mentioned in my original email, the trouble is that JobWrapper.finish() 
calls dataset.set_meta() before it calls collect_associated_files(), resulting 
dataset.extra_files_path being inaccurate because the files haven't been moved 
yet from the job working directory.  This is all with 
set_metadata_externally=False.  (I haven't worked with setting metadata 
externally yet, but I think it is worth verifying whether everything works 
correctly for the case I pointed out when set_metadata_externally=True.)

Since my last email, I poked around a bit more and found that my suggested 
short patch was not correct but incomplete.  The core problem is that component 
files are not moved with the primary file, so I changed that (patch attached, 
relative to { https://bitbucket.org/galaxy/galaxy-dist 5955:949e4f5fa03a }.  
Early in JobWrapper.finish() the primary file is moved from the working 
directory to the appropriate directory under config.file_path.  This patch uses 
the structure of the path naming convention to build the accurate path to the 
component files, and then moves them along with the primary file.  It's the 
least invasive (in terms of modifying Galaxy core code) potential fix I came up 
with, but since it relies explicitly on path structure and naming conventions I 
still think it's a bit of a hack

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-10-03 Thread Daniel Blankenberg
Hi Eric,

I was able to reproduce the error.  We'll work on a fix for this, but, for now, 
you can fix the metadata after uploading by clicking on the pencil icon and 
clicking Auto-detect. Thanks for reporting this error.


Thanks for using Galaxy,

Dan


On Oct 3, 2011, at 2:34 PM, Paniagua, Eric wrote:

 Hi Dan,
 
 Sure, here's the example where I discovered the bug (data files are not 
 attached because of a 5MB limit on my email client; see 
 http://main.g2.bx.psu.edu/u/paniag/h/metadata-bug-example for a history with 
 the example dataset).  The datatype is AffyBatch (or probably anything 
 derived from RexpBase) in lib/galaxy/datatypes/genetics.py.
 
  1.  On Galaxy main (http://main.g2.bx.psu.edu/) go to Get Data - Upload File
  2.  Select affybatch under file format.
  3.  Choose the attached files for upload in their corresponding fields.
  4.  Select genome mm9
  5.  Hit Execute
  6.  Wait for job completion, which occurs successfully (ie end up green)
  7.  Click to expand new history item
  8.  Note that the peek box displays an error message
 
 I further tested by trying to use the newly uploaded dataset with the reQC 
 tool from the Rexpression library (I don't see it on the main site).  By 
 looking into that code I realized metadata wasn't getting set, and (as 
 detailed below) I tracked the problem through the upload tool, (local) tool 
 runner, job wrapper, etc and discovered the metadata problem.
 
 Best,
 Eric
 
 
 From: Daniel Blankenberg [d...@bx.psu.edu]
 Sent: Monday, October 03, 2011 2:00 PM
 To: Paniagua, Eric
 Cc: Nate Coraor; galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Hi Eric,
 
 Can you provide an example of the composite datatype that is experiencing the 
 metadata problems?
 
 The code and tool hooks are deprecated, but there could still be some 
 instances where they may need to be used.  However, dynamic options (see eg 
 options from_dataset=input1  from 
 tools/filters/extract_GFF_Features.xml) can handle much of the functions 
 where code files are used for population parameters. Additionally, tool 
 output actions (see e.g. tools/filters/cutWrapper.xml) can be used to set 
 metadata ( e.g. column assignments ) on tool outputs without using hooks. 
 Please let us know if we can provide additional information.
 
 
 Thanks for using Galaxy,
 
 Dan
 
 
 On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:
 
 Hi Nate,
 
 Thanks for your answers.  I will look into setting 
 set_metadata_externally=True.
 
 I've observed no impact (suggesting someone did error handling properly), but 
 upload is the only tool I've tested it with so far.  I'll be doing more 
 shortly, but going with the exec_after_process solution since I don't then 
 need to modify core code here.
 
 I'm glad to hear that abstraction layer is in the works.
 
 Regarding the upload and general handling of compressed files, please refer 
 to my response to Brent a short time ago, subject line [galaxy-user] upload 
 zip file to custom tool.  In particular, I pointed out that it doesn't work 
 even with a manually set file type and why, at least in terms of code.  I was 
 hoping you might know if it was that way on purpose and, if so, why.
 
 Regarding deprecation of the code tag and of tool hooks (am I correct to 
 understand that these are deprecated too?), there's been an example at the 
 top of my mind (since I'm going to start working on this today).  In the 
 abstract, the scenario is dynamically populating the tool UI based on a 
 computation on the contents of one or more input datasets.  To an extent, the 
 conditional and page tags are helpful with this, but without the code 
 tag I'm not clear on how to call out to code that inspects the datasets and 
 returns, for the sake of simplicity, options for a single parameter.
 
 My specific example has to do with constructing a microarray expression 
 analysis pipeline, with one of my starting points being Ross Lazarus's 
 Rexpression code.  I may be able to get around the issue by storing the 
 relevant information in metadata (in a custom datatype's set_meta).  Not sure 
 yet.  Of course, that relies on metadata being handled correctly.  If I run 
 into something more specific, I'll start a new thread.
 
 Thanks,
 Eric
 
 
 From: Nate Coraor [n...@bx.psu.edu]
 Sent: Friday, September 30, 2011 11:03 AM
 To: Paniagua, Eric
 Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Paniagua, Eric wrote:
 Hi Nate,
 
 Thank you for your response!  I am glad that it was you in particular who did 
 respond, because I also have some questions about the way the upload tool 
 handles compressed files and saw that you have opened several Issues related 
 to this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
 further progress

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-10-03 Thread Daniel Blankenberg
Hi Eric,

Documentation for tool output actions is limited but available from 
http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Cactions.3E_tag_set
 and the code is located at lib/galaxy/tools/parameters/output.py. The 
LiftOver, Cut, BWA, fimo, tophat, ccat, srma, mosaik, bfast, sam to bam, and 
Bowtie tools also currently use them. 


Thanks for using Galaxy,

Dan


On Oct 3, 2011, at 2:51 PM, Paniagua, Eric wrote:

 Hi Dan,
 
 Regarding the tool output actions, could you point me to any documentation or 
 additional good examples for handling tool output post-processing?
 
 Thanks,
 Eric
 
 
 From: Daniel Blankenberg [d...@bx.psu.edu]
 Sent: Monday, October 03, 2011 2:00 PM
 To: Paniagua, Eric
 Cc: Nate Coraor; galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Hi Eric,
 
 Can you provide an example of the composite datatype that is experiencing the 
 metadata problems?
 
 The code and tool hooks are deprecated, but there could still be some 
 instances where they may need to be used.  However, dynamic options (see eg 
 options from_dataset=input1  from 
 tools/filters/extract_GFF_Features.xml) can handle much of the functions 
 where code files are used for population parameters. Additionally, tool 
 output actions (see e.g. tools/filters/cutWrapper.xml) can be used to set 
 metadata ( e.g. column assignments ) on tool outputs without using hooks. 
 Please let us know if we can provide additional information.
 
 
 Thanks for using Galaxy,
 
 Dan
 
 
 On Sep 30, 2011, at 1:15 PM, Paniagua, Eric wrote:
 
 Hi Nate,
 
 Thanks for your answers.  I will look into setting 
 set_metadata_externally=True.
 
 I've observed no impact (suggesting someone did error handling properly), but 
 upload is the only tool I've tested it with so far.  I'll be doing more 
 shortly, but going with the exec_after_process solution since I don't then 
 need to modify core code here.
 
 I'm glad to hear that abstraction layer is in the works.
 
 Regarding the upload and general handling of compressed files, please refer 
 to my response to Brent a short time ago, subject line [galaxy-user] upload 
 zip file to custom tool.  In particular, I pointed out that it doesn't work 
 even with a manually set file type and why, at least in terms of code.  I was 
 hoping you might know if it was that way on purpose and, if so, why.
 
 Regarding deprecation of the code tag and of tool hooks (am I correct to 
 understand that these are deprecated too?), there's been an example at the 
 top of my mind (since I'm going to start working on this today).  In the 
 abstract, the scenario is dynamically populating the tool UI based on a 
 computation on the contents of one or more input datasets.  To an extent, the 
 conditional and page tags are helpful with this, but without the code 
 tag I'm not clear on how to call out to code that inspects the datasets and 
 returns, for the sake of simplicity, options for a single parameter.
 
 My specific example has to do with constructing a microarray expression 
 analysis pipeline, with one of my starting points being Ross Lazarus's 
 Rexpression code.  I may be able to get around the issue by storing the 
 relevant information in metadata (in a custom datatype's set_meta).  Not sure 
 yet.  Of course, that relies on metadata being handled correctly.  If I run 
 into something more specific, I'll start a new thread.
 
 Thanks,
 Eric
 
 
 From: Nate Coraor [n...@bx.psu.edu]
 Sent: Friday, September 30, 2011 11:03 AM
 To: Paniagua, Eric
 Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Paniagua, Eric wrote:
 Hi Nate,
 
 Thank you for your response!  I am glad that it was you in particular who did 
 respond, because I also have some questions about the way the upload tool 
 handles compressed files and saw that you have opened several Issues related 
 to this on the Galaxy bitbucket site.  First though, I'll fill you in on my 
 further progress on the composite file issue.
 
 As I mentioned in my original email, the trouble is that JobWrapper.finish() 
 calls dataset.set_meta() before it calls collect_associated_files(), 
 resulting dataset.extra_files_path being inaccurate because the files haven't 
 been moved yet from the job working directory.  This is all with 
 set_metadata_externally=False.  (I haven't worked with setting metadata 
 externally yet, but I think it is worth verifying whether everything works 
 correctly for the case I pointed out when set_metadata_externally=True.)
 
 Since my last email, I poked around a bit more and found that my suggested 
 short patch was not correct but incomplete.  The core problem is that 
 component files are not moved with the primary file, so I changed that (patch 
 attached, relative to { https://bitbucket.org/galaxy/galaxy

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-09-30 Thread Nate Coraor
 tool, but I'll be less inclined to do so if that will make it 
 diverge si!
 gnificantly from Galaxy's expected design path.

We definitely need to make it easier to upload compressed files that
intentionally stay compressed.  This should be the case when setting the
specific file type, though.

 Finally, I noted that the code tag is deprecated.  I learned this from the 
 Tool Config Syntax page on the Wiki 
 (http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Ccode.3E_tag_set).
   It says there that it and the tool hooks (like exec_after_process) are 
 being replaced, but doesn't give many details.  Is there anything you can 
 tell me about what syntax or mechanism is planned to replace them?  The 
 ability to generate UI parameters, have the tool page itself interact with 
 other data sources and programs, use hooks to control tool execution, and 
 generally interleave arbitrary code with tool/job runs is very useful.  
 Unfortunately, the tools page appears to be out of date or just not remotely 
 comprehensive; I've found many examples of syntax in other tools' 
 configurations that are simply not documented on the Tool Config Syntax page 
 at all.

It's true that code is deprecated, although I can't provide much
useful information beyond that.  If you post up what you're using hooks
for one of my colleagues with more familiarity with tool configs can
probably reply with the best way to accomplish what you're doing without
hooks.

--nate

 
 Please do let me know if you're able to doing anything (with or without my 
 suggestions) about the issue I ran into with the way Galaxy handles composite 
 datasets on the file system.  Or anything more generally about composite 
 dataset design intentions, upload functionality, and known issues.  I'd 
 greatly appreciate it.  Thanks for answering my email!
 
 
 Thanks,
 Eric
 
 
 PS: I'm also unable to download composite datasets that I successfully 
 uploaded; instead I get an error page saying the URL path doesn't exist.  
 However, I haven't dug into this one enough to think it's not a local 
 configuration problem on my end.  Do you recognize this as (related to) a 
 known issue?
 
 
 From: Nate Coraor [n...@bx.psu.edu]
 Sent: Wednesday, September 28, 2011 9:24 AM
 To: Paniagua, Eric
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Paniagua, Eric wrote:
  Hi all,
 
  Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
  dataset_path.false_path to dataset_path.real_path (contingent on 
  config.outputs_to_working_directory == True) but does not move the extra 
  files?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you 
  want to move a dataset, you want to move the whole dataset, and that 
  perhaps this should be factored out, perhaps into the galaxy.util module?
 
  Why does class DatasetPath only account for the path to the primary file 
  and not the path to the extra files?  It could be used to account for the 
  extra files by path splitting as in my previous suggested bug fix, but 
  only if that fix is correct.  It doesn't seem to be used for that purpose 
  in the Galaxy code.
 
  I look forward to an informative response.
 
 Hi Eric,
 
 Sorry for the delay in responding, and thanks for your very detailed
 digging into this problem.  To answer your above question, DatasetPath
 only deals with the primary dataset because extra files are already
 written to the working directory and moved back by the wrapper with
 collect_associated_files().
 
 Making it possible to read these extra files from the working directory
 will be necessary when set_metadata_externally = True in the config,
 although I am surprised this has been broken the whole time.  Have you
 made progress past your last email?  I can pick up from wherever you've
 left off.
 
 
  Thanks,
  Eric Paniagua
 
  
  From: galaxy-dev-boun...@lists.bx.psu.edu 
  [galaxy-dev-boun...@lists.bx.psu.edu] on behalf of Paniagua, Eric 
  [epani...@cshl.edu]
  Sent: Monday, September 12, 2011 7:37 PM
  To: galaxy-dev@lists.bx.psu.edu
  Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
  Hello again,
 
  It looks like the config.outputs_to_working_directory variable is intended 
  to do something closely related, but setting it to either of True and False 
  does not in fact fix the problem.
 
  The output path for files in a composite dataset upload 
  (dataset.files_path) that is used in the tools/data_source/upload.xml tool 
  is set to a path under the job working directory by 
  lib/galaxy/tools/__init__.py:1519.  The preceding code (lines 1507-1516) 
  select the path for the primary file contingent on 
  config.outputs_to_working_directory.
 
  Why is the path set in line 1519 not contingent on 
  config.outputs_to_working_directory?  Indeed, the following small change 
  fixes the bug

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-09-28 Thread Nate Coraor
Paniagua, Eric wrote:
 Hi all,
 
 Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
 dataset_path.false_path to dataset_path.real_path (contingent on 
 config.outputs_to_working_directory == True) but does not move the extra 
 files?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you 
 want to move a dataset, you want to move the whole dataset, and that perhaps 
 this should be factored out, perhaps into the galaxy.util module?
 
 Why does class DatasetPath only account for the path to the primary file and 
 not the path to the extra files?  It could be used to account for the 
 extra files by path splitting as in my previous suggested bug fix, but only 
 if that fix is correct.  It doesn't seem to be used for that purpose in the 
 Galaxy code.
 
 I look forward to an informative response.

Hi Eric,

Sorry for the delay in responding, and thanks for your very detailed
digging into this problem.  To answer your above question, DatasetPath
only deals with the primary dataset because extra files are already
written to the working directory and moved back by the wrapper with
collect_associated_files().

Making it possible to read these extra files from the working directory
will be necessary when set_metadata_externally = True in the config,
although I am surprised this has been broken the whole time.  Have you
made progress past your last email?  I can pick up from wherever you've
left off.

 
 Thanks,
 Eric Paniagua
 
 
 From: galaxy-dev-boun...@lists.bx.psu.edu 
 [galaxy-dev-boun...@lists.bx.psu.edu] on behalf of Paniagua, Eric 
 [epani...@cshl.edu]
 Sent: Monday, September 12, 2011 7:37 PM
 To: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Hello again,
 
 It looks like the config.outputs_to_working_directory variable is intended to 
 do something closely related, but setting it to either of True and False does 
 not in fact fix the problem.
 
 The output path for files in a composite dataset upload (dataset.files_path) 
 that is used in the tools/data_source/upload.xml tool is set to a path under 
 the job working directory by lib/galaxy/tools/__init__.py:1519.  The 
 preceding code (lines 1507-1516) select the path for the primary file 
 contingent on config.outputs_to_working_directory.
 
 Why is the path set in line 1519 not contingent on 
 config.outputs_to_working_directory?  Indeed, the following small change 
 fixes the bug I'm observing:
 
 diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
 --- a/lib/galaxy/tools/__init__.py  Mon Aug 29 14:42:04 2011 -0400
 +++ b/lib/galaxy/tools/__init__.py  Mon Sep 12 19:32:26 2011 -0400
 @@ -1516,7 +1516,9 @@
  param_dict[name] = DatasetFilenameWrapper( hda )
  # Provide access to a path to store additional files
  # TODO: path munging for cluster/dataset server relocatability
 -param_dict[name].files_path = os.path.abspath(os.path.join( 
 job_working_directory, dataset_%s_files % (hda.dataset.id) ))
 +#param_dict[name].files_path = os.path.abspath(os.path.join( 
 job_working_directory, dataset_%s_files % (hda.dataset.id) ))
 +# This version should make it always follow the primary file
 +param_dict[name].files_path = os.path.abspath( os.path.join( 
 os.path.split( param_dict[name].file_name )[0], dataset_%s_files % 
 (hda.dataset.id) ))
  for child in hda.children:
  param_dict[ _CHILD___%s___%s % ( name, child.designation ) 
 ] = DatasetFilenameWrapper( child )
  for out_name, output in self.outputs.iteritems():
 
 Would this break anything?
 
 If that cannot be changed, would the best solution be to modify the upload 
 tool so that it took care of this on its own?  That seems readily doable, but 
 starts to decentralize control of data flow policy.
 
 Please advise.
 
 Thanks,
 Eric Paniagua
 
 From: galaxy-dev-boun...@lists.bx.psu.edu 
 [galaxy-dev-boun...@lists.bx.psu.edu] on behalf of Paniagua, Eric 
 [epani...@cshl.edu]
 Sent: Monday, September 12, 2011 1:45 PM
 To: galaxy-dev@lists.bx.psu.edu
 Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata
 
 Hi everyone,
 
 I've been getting my feet wet with Galaxy development working to get some of 
 the rexpression tools online, and I've run into a snag that I've traced back 
 to a set_meta datatype method not being able to find a file from which it 
 wants to extract metadata.  After reading the code, I believe this would also 
 be a problem for non-composite datatypes.
 
 The specific test case I've been looking at is uploading an affybatch file 
 (and associated pheno file) using Galaxy's built-in upload tool and selecting 
 the File Format manually (ie choosing affybatch in the dropdown).  I am 
 using unmodified datatype definitions provided in 
 lib/galaxy/datatypes/genetics.py

Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-09-13 Thread Paniagua, Eric
Hi all,

Can anyone tell me why JobWrapper.finish() moves the primary dataset file 
dataset_path.false_path to dataset_path.real_path (contingent on 
config.outputs_to_working_directory == True) but does not move the extra 
files?  (lib/galaxy/jobs/__init__.py:540-553)  It seems to me that if you want 
to move a dataset, you want to move the whole dataset, and that perhaps this 
should be factored out, perhaps into the galaxy.util module?

Why does class DatasetPath only account for the path to the primary file and 
not the path to the extra files?  It could be used to account for the extra 
files by path splitting as in my previous suggested bug fix, but only if that 
fix is correct.  It doesn't seem to be used for that purpose in the Galaxy code.

I look forward to an informative response.

Thanks,
Eric Paniagua


From: galaxy-dev-boun...@lists.bx.psu.edu [galaxy-dev-boun...@lists.bx.psu.edu] 
on behalf of Paniagua, Eric [epani...@cshl.edu]
Sent: Monday, September 12, 2011 7:37 PM
To: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hello again,

It looks like the config.outputs_to_working_directory variable is intended to 
do something closely related, but setting it to either of True and False does 
not in fact fix the problem.

The output path for files in a composite dataset upload (dataset.files_path) 
that is used in the tools/data_source/upload.xml tool is set to a path under 
the job working directory by lib/galaxy/tools/__init__.py:1519.  The preceding 
code (lines 1507-1516) select the path for the primary file contingent on 
config.outputs_to_working_directory.

Why is the path set in line 1519 not contingent on 
config.outputs_to_working_directory?  Indeed, the following small change fixes 
the bug I'm observing:

diff -r 949e4f5fa03a lib/galaxy/tools/__init__.py
--- a/lib/galaxy/tools/__init__.py  Mon Aug 29 14:42:04 2011 -0400
+++ b/lib/galaxy/tools/__init__.py  Mon Sep 12 19:32:26 2011 -0400
@@ -1516,7 +1516,9 @@
 param_dict[name] = DatasetFilenameWrapper( hda )
 # Provide access to a path to store additional files
 # TODO: path munging for cluster/dataset server relocatability
-param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, dataset_%s_files % (hda.dataset.id) ))
+#param_dict[name].files_path = os.path.abspath(os.path.join( 
job_working_directory, dataset_%s_files % (hda.dataset.id) ))
+# This version should make it always follow the primary file
+param_dict[name].files_path = os.path.abspath( os.path.join( 
os.path.split( param_dict[name].file_name )[0], dataset_%s_files % 
(hda.dataset.id) ))
 for child in hda.children:
 param_dict[ _CHILD___%s___%s % ( name, child.designation ) ] 
= DatasetFilenameWrapper( child )
 for out_name, output in self.outputs.iteritems():

Would this break anything?

If that cannot be changed, would the best solution be to modify the upload tool 
so that it took care of this on its own?  That seems readily doable, but starts 
to decentralize control of data flow policy.

Please advise.

Thanks,
Eric Paniagua

From: galaxy-dev-boun...@lists.bx.psu.edu [galaxy-dev-boun...@lists.bx.psu.edu] 
on behalf of Paniagua, Eric [epani...@cshl.edu]
Sent: Monday, September 12, 2011 1:45 PM
To: galaxy-dev@lists.bx.psu.edu
Subject: [galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Hi everyone,

I've been getting my feet wet with Galaxy development working to get some of 
the rexpression tools online, and I've run into a snag that I've traced back to 
a set_meta datatype method not being able to find a file from which it wants to 
extract metadata.  After reading the code, I believe this would also be a 
problem for non-composite datatypes.

The specific test case I've been looking at is uploading an affybatch file (and 
associated pheno file) using Galaxy's built-in upload tool and selecting the 
File Format manually (ie choosing affybatch in the dropdown).  I am using 
unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py 
and unmodified core Galaxy upload code as of 5955:949e4f5fa03a.  (I am also 
testing with modified versions, but I am able to reproduce and track this bug 
in the specified clean version).

The crux of the cause of error is that in JobWrapper.finish(), 
dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
composite dataset uploaded files are moved (in a call to a Tool method 
self.tool.collect_associated_files(out_data, self.working_directory) on line 
670) from the job working directory to the final destination under 
config.file_path (which defaults to database/files).

In my test case, database.set_meta( overwrite = False ) eventually calls 
lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset

[galaxy-dev] (Composite) Dataset Upload not Setting Metadata

2011-09-12 Thread Paniagua, Eric
Hi everyone,

I've been getting my feet wet with Galaxy development working to get some of 
the rexpression tools online, and I've run into a snag that I've traced back to 
a set_meta datatype method not being able to find a file from which it wants to 
extract metadata.  After reading the code, I believe this would also be a 
problem for non-composite datatypes.

The specific test case I've been looking at is uploading an affybatch file (and 
associated pheno file) using Galaxy's built-in upload tool and selecting the 
File Format manually (ie choosing affybatch in the dropdown).  I am using 
unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py 
and unmodified core Galaxy upload code as of 5955:949e4f5fa03a.  (I am also 
testing with modified versions, but I am able to reproduce and track this bug 
in the specified clean version).

The crux of the cause of error is that in JobWrapper.finish(), 
dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the 
composite dataset uploaded files are moved (in a call to a Tool method 
self.tool.collect_associated_files(out_data, self.working_directory) on line 
670) from the job working directory to the final destination under 
config.file_path (which defaults to database/files).

In my test case, database.set_meta( overwrite = False ) eventually calls 
lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I 
can tell, the only ways to construct a path to a file (or the file) in a 
dataset without using hard-coded paths from external knowledge is to use the 
Dataset.get_file_name or Dataset.extra_files_path properties.  Unless 
explicitly told otherwise, both of these methods construct a path based on the 
Dataset.file_path class data member, whose value is set during Galaxy startup 
to config.file_path (default database/files).  However, at the time set_meta 
is called in this case, the files are not under config.file_path, but rather 
under the job working directory.  Attempting to open files from the dataset 
therefore fails when using these paths.  However, unless the job working 
directory is passed to set_meta or during construction of the underlying 
Dataset object, there doesn't appear to be a way for a Dataset method to access 
th!
 e currently running job (for instance to get its job ID or working directory). 
 (The second suggestion is actually not possible; since the standard upload is 
asynchronous, the Dataset object is created (and persisted) before the Job that 
will process it is created.)

Thoughts?  This issue affects Rexp.set_peek also, as well as any other 
functions that may want to read data from the uploaded files before they are 
moved to permanent location.  This is why if you have an affybatch file and its 
associated pheno file and you test this on, say, the public Galaxy server at 
http://main.g2.bx.psu.edu/ you'll see that the peek info says (for example): 
##failed to find 
/galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno

It seems that if the current way that Dataset.file_path, Dataset.file_name, and 
Dataset.extra_files_path is part of the desired design of Galaxy, that methods 
like set_meta should be run after the files have been moved to config.file_path 
so they can set metadata based on file contents.  It looks like this is 
intended to happen at least in some cases, from looking at 
lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not 
kicking in because hda_tool_output is None.

Any clarification on what's happening here, what's supposed to be happening for 
setting metadata on (potentially composite) uploads, why dataset.set_meta() 
isn't already being called after the files are moved to config.file_path, or 
any insights on related Galaxy design decisions I may not know about or design 
constraints I may have missed would be very greatly appreciated.

I'd also be glad to provide further detail or test files upon request.

Thank you,
Eric Paniagua

PS: Further notes on passing the job working directory to set_meta or set_peek 
- I have been successful modifying the code to do this for set_meta since the 
call chain starting from dataset.set_meta() in JobWrapper.finish() to 
Rexp.set_meta() accepts and forwards keyword argument dictionaries along the 
way.  However, set_peek does not accept arbitrary keyword arguments, making it 
harder to pass along the job working directory when needed without stepping on 
the toes of any other code.

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/