I thought I was working with fairly large datasets, but they have
recently started to include ~2Gb files in sets of >50. I have ran these
sort of things before as merged data by using tar to roll them up in one
set, but when dealing with >100Gb tarfiles, Galaxy on EC2 seems to get
very slow, although that's probably because of my implementation of
dataset type detection (untar and read through files).
Since tarring/untarring isn't very clean, I want to switch from tarring
to creating composite files on merge by putting a tool's results into
the dataset.extra_files_path. This doesn't seem to be supported yet,
because we currently pass in do_merge the output dataset.filename to the
respective datatype's merge method.
I would like to pass more data to the merge method (let's say the whole
dataset object) to be able to get the composite files directory and
'merge' the files in there. Good idea, bad idea? If anyone has views on
this, I'd love to hear them.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: