Glen Beane wrote:
> On Feb 11, 2011, at 9:32 AM, Nate Coraor wrote:
> > Glen Beane wrote:
> >> On Feb 9, 2011, at 9:44 AM, Glen Beane wrote:
> >>> I've been doing some testing with a Galaxy instance running on my laptop
> >>> for some tools we are developing. I am uploading a file into Galaxy from
> >>> a URL to use as test input (~1.5GB tabular) I can download this file to
> >>> my laptop in ~30 seconds with wget, while if I pull from the same URL
> >>> into Galaxy it takes about 30 minutes. I set the file type so Galaxy did
> >>> not have to auto-detect.
> >>> This seems very slow considering it only takes about 30 seconds to get
> >>> the file over the network and write it to disk. What is Galaxy doing that
> >>> makes this file upload so slow? We also tried defining our own datatype
> >>> (data, not tabular with the thought that maybe Galaxy tried to examine
> >>> tabular files), but it is still very slow. In production our input files
> >>> will grow to be much larger than this (although we'll probably abandon
> >>> tabular for a more compact binary format by then).
> >> So no insight as to why a 1.5GB file takes 60 times as long to load into
> >> galaxy via URL as it takes to download the file from the same URL outside
> >> of Galaxy? I'm assuming it has to do with detecting Metadata, since
> >> changing the file type from our custom tabular type to the galaxy tabular
> >> type causes a set metadata job that takes at least 20 minutes (I didn't
> >> time it). However, I changed our data type from tabular to "data" hoping
> >> Galaxy would just ignore the file contents and it still takes 30 minutes
> >> to load into Galaxy.
> >> We haven't updated to the latest galaxy-dist (it is on our todo list to
> >> synch up), but this seems like it takes much longer than it should and is
> >> a problem with the implementation
> > Hi Glen,
> > Sorry, I haven't had a chance to address your question yet. The reason
> > is most likely metadata as you have surmised. Do you have:
> > set_metadata_externally = True
> > Set in universe_wsgi.ini?
> I'm not sure. I'll check. What does this setting do?
Python has a limitation when using threads in that it's not true
threading - only one thread can actually be on CPU at a time. Because
detecting metadata can be very CPU-intensive, it has to contend with and
often suffers from (and blocks operation of) other threads in the Galaxy
set_metadata_externally = True moves the operation of detecting metadata
to a separate OS process, meaning it does not contend for the same
resources as Galaxy itself.
This should yield a performance increase, but I suspect the main cause
of the slowness is due to trying to detect column types for the entire
1.5GB file. The enchancements in the newest dist release will cause it
to only check the first 100,000 lines.
Some metadata elements are also optional, and choosing not to set them
for large files can be configured using 'max_optional_metadata_filesize'
in datatypes_conf.xml. This also requires the latest stable
> > Also, there are some recent changes in the newest dist release which
> > limit the number of lines checked for metadata that should make this
> > process much faster.
> Thanks, we'll try to update our test Galaxy instance to the newest dist
> releast to see if that helps.
> Glen L. Beane
> Software Engineer
> The Jackson Laboratory
> Phone (207) 288-6153
> galaxy-dev mailing list
To manage your subscriptions to this and other Galaxy lists, please use the