Hello All,

  Thanks for the well laid out e-mails and great discussion. I think
John-Paul's comment about the code growing up organically is probably
exactly right. (A link below has some details from Nate about this).

  So late last night I opened a sprawling pull request that cleaned up
a lot of stuff in upload.py and then realized it was a little bit
incorrect and wouldn't actually help any of you until you were able to
upgrade to the August 2014 release :) so I declined. I have reworked a
relatively small patch to fix the immediate problem of the consistency
tmpfile consistency as it relates to shutil.move. John-Paul Robinson,
can you apply it directly and see if it fixes your problems?


  If it does, the devteam will get this merged and I will continue
with the upload.py improvements that were inspired by this discussion
(see https://bitbucket.org/galaxy/galaxy-central/pull-request/408 for
more details).


On Mon, Jun 9, 2014 at 2:50 PM, John-Paul Robinson <j...@uab.edu> wrote:
> We've considered the sudo solution, but it opens the window to other
> bugs giving galaxy the power to change ownership of other files in our
> shared user cluster environment.  We could isolate the power to a script
> but then we still need to monitor this code closely.  We'd prefer not to
> introduce that requirement.
> I didn't have the time to trace this down either. ;)  I just got tired
> of this issue and the inconsistent failures causing confusion in our
> community.
> I hope your insight into the logic drift over time is accurate and can
> be corrected.  The upload code looks like it's gone through a whole lot
> of organic growth. :/
> Looking forward to additional comments from the dev team.
> ~jpr
> On 06/09/2014 03:30 PM, Kandalaft, Iyad wrote:
>> Hi JPR,
>> I had the same questions while trying to figure out a fool-proof way to 
>> allow users to import files into galaxy on our Cluster.  I couldn't exactly 
>> figure out, nor did I have the time to really review, why the galaxy code 
>> did these steps and why that shutil.move failed.  I opted to simply insert 
>> code in upload.py to sudo chown/chmod the files as an "easier" "hack" to 
>> this problem.  There are pros and cons to using the tmp var from the env, 
>> and it will depend on your intentions/infrastructure.  I think the ideology 
>> was that the Galaxy folder was supposed to be shared across all nodes in a 
>> cluster, and they opted to use the TMP path within the galaxy folder.  
>> Overtime, the code probably partially diverged from that notion, which 
>> caused this dilemma.
>> I believe that the best fix is to make the underlying code simply copy the 
>> files into the environment-provided temp, which is configurable in galaxy's 
>> universe_wsgi.ini, and assume ownership from the get-go.  This code of 
>> copying and/or moving in discrete steps creates unnecessary complexity.
>> Regards,
>> Iyad
>> -----Original Message-----
>> From: galaxy-dev-boun...@lists.bx.psu.edu 
>> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John-Paul Robinson
>> Sent: Monday, June 09, 2014 3:08 PM
>> To: galaxy-dev@lists.bx.psu.edu
>> Subject: [galaxy-dev] inconsistent use of tempfile.mkstemp during upload 
>> causes problems
>> There appears to be some inconsistent use of tempfile.mkstemp() within 
>> upload.py that causes problems when users import data files to galaxy from a 
>> cluster directory via the upload process and import/temp/dataset directories 
>> are on different file systems.
>> The issue manifests when Galaxy's job directory, dataset directory and 
>> import directory are on different file systems (common for cluster
>> environments) in conjunction with a configuration where users can copy their 
>> data files directly to the import directory from which Galaxy selects data 
>> sets to upload (as opposed to using an FTP gateway).
>> While allowing users to copy files to an import directory rather than using 
>> the FTP gateway may not be that common, we use this configuration locally to 
>> help build a more seamless interface with our local collection of HPC 
>> resources.  Users can be logged into their cluster account and move data 
>> into galaxy with a file copy command rather than having to use FTP.
>> This configuration has worked well in our environment as long as the correct 
>> ownership configuration existed on the import directory and as long as the 
>> import directory, job temporary directory, and galaxy data set directory 
>> were all on the same file system.
>> We now have our galaxy dataset directory on a different file system and are 
>> seeing inconsistent behavior during the upload.py runs depending on if the 
>> data is ordinary text, BAM files, or gzipped data.
>> A subset of uploads will fail because of the way temporary files are created 
>> by Galaxy to facilitate the import and any associated conversion processes 
>> of different file types.
>> During the import,
>> 1) Galaxy will copy the original file to a temporary target file (converting 
>> as needed during the copy).
>> 2) Once this first conversion step is complete, Galaxy then attempts to move 
>> the temporary file back to the original location, ie. the import directory.
>> 3) If this move is succeeds, Galaxy completes the upload processing and the 
>> data becomes a registered data set in the user's dataset collection.
>> Galaxy prefers the Python shutil.move method to move tempfile . This results 
>> in a simple os.rename if the files remain on the same file system.  However, 
>> if os.rename raises OSError because a move was attempted across a file 
>> system boundary, shutil.move resorts to a copy2, which copies the data to 
>> the original import file and then tries to copy the file attributes 
>> (permissions and utimes) to the original import file from the source file 
>> (which will be the temporary file Galaxy created in step 1 to begin the 
>> conversion process).
>> The os.rename and shutil.copy2 behave (and fail) differently depending on 
>> the file ownership of the original import file.  The os.rename will succeed 
>> even if the Galaxy upload.py job process only maps to the group-owner of the 
>> original import file (which can be ensured with group sticky bit on the 
>> import dir or ACLs). The shutil.copy2 command, however, will fail if the 
>> Galaxy upload.py job process UID is not the user-owner of the original 
>> import file.
>> We could ensure the os.rename succeeds by keeping the job temporary 
>> directory and the import directory on the same file system.  However, it 
>> seems the temporary directories used by upload.py are inconsistent across 
>> data types which prevents this simple fix from working for all data types.
>> When text files are imported, upload calls the sniff.* methods to perform 
>> conversion.  These methods use a bare call to tempfile.mkstemp() which 
>> ensures the file is created in the directory specified by the env var 
>> $TMPDIR. For example in sniff.convert_newlines:
>>     fd, temp_name = tempfile.mkstemp()
>> https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb63fa0c803ef6/lib/galaxy/datatypes/sniff.py?at=default#cl-105
>> However, for compressed files, the upload.py script directly creates temp 
>> files but here it specifies the target directory as the same as the data set 
>> directory:
>>     fd, uncompressed = tempfile.mkstemp( prefix='data_id_%s_upload_gunzip_' 
>> % dataset.dataset_id, dir=os.path.dirname( output_path ), text=False )
>> https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb63fa0c803ef6/tools/data_source/upload.py?at=default#cl-130
>> It's not clear if there is any significance to using the data set directory 
>> as the tempdir for compressed files versus the job temporary directory for 
>> other data files.
>> It seems like all temporary files created by upload.py should be 
>> consistently created in the same temporary location, and preferably in the 
>> job temp directory.
>> Is there a reason that these file types use different temporary file 
>> locations?
>> If they used the same tempfile location, we could use one consistent system 
>> configuration and ensure all our data files can be imported even when the 
>> import+tempdir are not on the same file system as the Galaxy dataset dir.  
>> It seems reasonable that all tempfile.mkstemp() calls should be unadorned 
>> and inherit the temp directory location from their environment.
>> A more comprehensive solution that would correct the inconsistency in 
>> failures between os.rename and shutil.copy2 and also remove any constraint 
>> for Galaxy to have it's import, temp, and data set directories on the same 
>> file system, would be to simply delete the original import file before 
>> attempting the shutil.move.  This would ensure the file that the upload.py 
>> job attempts to create in step 2 is new and created with full Galaxy process 
>> ownership.
>> Finally, it seems odd that Galaxy attempts to reuse the users original 
>> import file in the first place.  It seems that once galaxy begins processing 
>> the content of the to-be-imported file, it should not ever write back to 
>> that file.  What's the motivation here?
>> I'll be interested to learn more about the motivations of these different 
>> tempfile conventions and if this can be fixed in the upstream.
>> Thanks,
>> ~jpr
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this and other Galaxy 
>> lists, please use the interface at:
>>   http://lists.bx.psu.edu/
>> To search Galaxy mailing lists use the unified search at:
>>   http://galaxyproject.org/search/mailinglists/
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

Reply via email to