Iyad,
Thanks for chiming in. We looked at the patch you had posted, but we run on a
cluster galaxy shares with other apps, so the galaxy user does not have sudo
outside of the galaxy head-node.
The upload.py code you posted, therefore does not work for us.
Therefore, we're also looking at a chown hook in
lib/galaxy/tools/actions/upload_common.py, but find that code a little more
inscrutable ( see earlier thread with you on this subject )
http://dev.list.galaxyproject.org/Uploading-files-to-galaxy-from-a-folder-tt4664614.html#a4664669
However, both chown hooks seem too heavy handed, and like a workaround or
band-aid on an underlying problem. I hate to use elevated privileges (with the
security risks that brings) if we don't really have to. That being said, we
need to change something in the short terms so things work, be it the
filesystem layout, the upload.py workflow, or a chown in upload_common.py.
Regards,
Curtis
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Kandalaft, Iyad
Sent: Monday, June 09, 2014 2:30 PM
To: John-Paul Robinson (Campus); [email protected]
Subject: Re: [galaxy-dev] inconsistent use of tempfile.mkstemp during upload
causes problems
Hi JPR,
I had the same questions while trying to figure out a fool-proof way to allow
users to import files into galaxy on our Cluster. I couldn't exactly figure
out, nor did I have the time to really review, why the galaxy code did these
steps and why that shutil.move failed. I opted to simply insert code in
upload.py to sudo chown/chmod the files as an "easier" "hack" to this problem.
There are pros and cons to using the tmp var from the env, and it will depend
on your intentions/infrastructure. I think the ideology was that the Galaxy
folder was supposed to be shared across all nodes in a cluster, and they opted
to use the TMP path within the galaxy folder. Overtime, the code probably
partially diverged from that notion, which caused this dilemma.
I believe that the best fix is to make the underlying code simply copy the
files into the environment-provided temp, which is configurable in galaxy's
universe_wsgi.ini, and assume ownership from the get-go. This code of copying
and/or moving in discrete steps creates unnecessary complexity.
Regards,
Iyad
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of John-Paul Robinson
Sent: Monday, June 09, 2014 3:08 PM
To: [email protected]
Subject: [galaxy-dev] inconsistent use of tempfile.mkstemp during upload causes
problems
There appears to be some inconsistent use of tempfile.mkstemp() within
upload.py that causes problems when users import data files to galaxy from a
cluster directory via the upload process and import/temp/dataset directories
are on different file systems.
The issue manifests when Galaxy's job directory, dataset directory and import
directory are on different file systems (common for cluster
environments) in conjunction with a configuration where users can copy their
data files directly to the import directory from which Galaxy selects data sets
to upload (as opposed to using an FTP gateway).
While allowing users to copy files to an import directory rather than using the
FTP gateway may not be that common, we use this configuration locally to help
build a more seamless interface with our local collection of HPC resources.
Users can be logged into their cluster account and move data into galaxy with a
file copy command rather than having to use FTP.
This configuration has worked well in our environment as long as the correct
ownership configuration existed on the import directory and as long as the
import directory, job temporary directory, and galaxy data set directory were
all on the same file system.
We now have our galaxy dataset directory on a different file system and are
seeing inconsistent behavior during the upload.py runs depending on if the data
is ordinary text, BAM files, or gzipped data.
A subset of uploads will fail because of the way temporary files are created by
Galaxy to facilitate the import and any associated conversion processes of
different file types.
During the import,
1) Galaxy will copy the original file to a temporary target file (converting as
needed during the copy).
2) Once this first conversion step is complete, Galaxy then attempts to move
the temporary file back to the original location, ie. the import directory.
3) If this move is succeeds, Galaxy completes the upload processing and the
data becomes a registered data set in the user's dataset collection.
Galaxy prefers the Python shutil.move method to move tempfile . This results in
a simple os.rename if the files remain on the same file system. However, if
os.rename raises OSError because a move was attempted across a file system
boundary, shutil.move resorts to a copy2, which copies the data to the original
import file and then tries to copy the file attributes (permissions and utimes)
to the original import file from the source file (which will be the temporary
file Galaxy created in step 1 to begin the conversion process).
The os.rename and shutil.copy2 behave (and fail) differently depending on the
file ownership of the original import file. The os.rename will succeed even if
the Galaxy upload.py job process only maps to the group-owner of the original
import file (which can be ensured with group sticky bit on the import dir or
ACLs). The shutil.copy2 command, however, will fail if the Galaxy upload.py job
process UID is not the user-owner of the original import file.
We could ensure the os.rename succeeds by keeping the job temporary directory
and the import directory on the same file system. However, it seems the
temporary directories used by upload.py are inconsistent across data types
which prevents this simple fix from working for all data types.
When text files are imported, upload calls the sniff.* methods to perform
conversion. These methods use a bare call to tempfile.mkstemp() which ensures
the file is created in the directory specified by the env var $TMPDIR. For
example in sniff.convert_newlines:
fd, temp_name = tempfile.mkstemp()
https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb63fa0c803ef6/lib/galaxy/datatypes/sniff.py?at=default#cl-105
However, for compressed files, the upload.py script directly creates temp files
but here it specifies the target directory as the same as the data set
directory:
fd, uncompressed = tempfile.mkstemp( prefix='data_id_%s_upload_gunzip_' %
dataset.dataset_id, dir=os.path.dirname( output_path ), text=False )
https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb63fa0c803ef6/tools/data_source/upload.py?at=default#cl-130
It's not clear if there is any significance to using the data set directory as
the tempdir for compressed files versus the job temporary directory for other
data files.
It seems like all temporary files created by upload.py should be consistently
created in the same temporary location, and preferably in the job temp
directory.
Is there a reason that these file types use different temporary file locations?
If they used the same tempfile location, we could use one consistent system
configuration and ensure all our data files can be imported even when the
import+tempdir are not on the same file system as the Galaxy dataset dir. It
seems reasonable that all tempfile.mkstemp() calls should be unadorned and
inherit the temp directory location from their environment.
A more comprehensive solution that would correct the inconsistency in failures
between os.rename and shutil.copy2 and also remove any constraint for Galaxy to
have it's import, temp, and data set directories on the same file system, would
be to simply delete the original import file before attempting the shutil.move.
This would ensure the file that the upload.py job attempts to create in step 2
is new and created with full Galaxy process ownership.
Finally, it seems odd that Galaxy attempts to reuse the users original import
file in the first place. It seems that once galaxy begins processing the
content of the to-be-imported file, it should not ever write back to that file.
What's the motivation here?
I'll be interested to learn more about the motivations of these different
tempfile conventions and if this can be fixed in the upstream.
Thanks,
~jpr
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this and other Galaxy
lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this and other Galaxy
lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/