Just to provide a little more detail on the collection point... Daniel
Blankenberg presented a talk at the Galaxy Community Conference on
metagenomics that talked a lot about "large" collection handling in
Galaxy - https://gcc16.sched.org/event/5Y0M/metagenomics-with-galaxy.
I think his rule of thumb was that a few hundred or a couple thousand
datasets in a collection can work through the GUI but there are real
performance problems with several thousand or tens of thousands of
datasets. The whole devteam has helped in making strides on this - and
I expect over the coming releases we will keep addressing these
limitations and pushing that number higher. I'd say dealing with a
hundred thousand datasets in a collection through the GUI isn't
currently feasible. These things might be okay if driven through the
API or something - but I'm not sure that would work either. It might
be better to continue to work with zip files for now if that is the
scale you need to reach.
Thanks for the detailed question and for working on such an exciting
application of Galaxy,
On Thu, Aug 4, 2016 at 11:44 AM, Stephan Oepen <o...@ifi.uio.no> wrote:
> in our adaptation of galaxy for large-scale natural language
> processing, a fairly common use pattern is to invoke a workflow on a
> potentially large number of text files. hence, i am wondering about
> facilities for uploading an archive (in ‘.zip’ or ‘.tgz’ format, say)
> containing several files, where i would like the upload tool to
> extract the files from the archive, import each individually into my
> history, and (maybe optionally) create a list collection for the set
> of files.
> in my current galaxy instance (running version 2015.03), when i upload
> a multi-file ‘.zip’ file, part of the above actually happens: however,
> the upload tool only imports the first file extracted from the archive
> (and helpfully shows a warning message on the corresponding history
> entry). have there been relevant changes in this neighborhood in more
> recent galaxy releases?
> related to the above, we have started to experiment with potentially
> large collections and are beginning to worry about the scalability of
> the collection mechanism. in principle, we would like to operate on
> collections comprised of tens or hundreds of thousands of individual
> datasets. what are common collection sizes (in the number of
> components, not so much in the aggregate file size) used in other
> galaxy instances to date? what kind of gut reaction do galaxy
> developers have to the idea of a collection containing, say, a hundred
> thousand entries?
> with thanks in advance,
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> To search Galaxy mailing lists use the unified search at:
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at: