Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

Assaf Gordon Mon, 21 Feb 2011 12:05:18 -0800

Hi Nick,

If you're running your own local instance, then nothing is impossible - it's 
just a bit ugly...


Nate Coraor wrote, On 02/21/2011 12:36 PM:
>> [...] many tools that operate on these files produces multiple,
>> uncompressed large files which quickly eat up the disk allocation.
>> [...] With no way to compress or archive files produced by
>> intermediate steps in a workflow, [...]

Here's a tool that compresses an input galaxy dataset and then deletes the 
input file.
Deleting the input dataset from underneath galaxy's feet obviously goes against 
everything galaxy stands for, 
and I'm sure the Galaxy team does not endorse such solutions. It will also 
slightly make your database out-of-sync with the real files on the disk.
But hey - desperate times call for desperate means :)

========
<tool id="cshl_compress_input" name="Compress Input File">
  <description>for advanced users only!</description>
  <command>gzip -c  '$input' &gt; '$output' &amp;&amp; rm '$input'</command>
  <inputs>
    <param format="data" name="input" type="data" label="Dataset to Compress" />
    <param format="data" name="waitforinput" type="data" label="Tool to wait 
for" />
  </inputs>
  <outputs>
    <data format="gzip" name="output" />
  </outputs>
  <help>
**What it does**
DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
  </help>
</tool>
============

The second "input" parameter in this tool is there only to force this tool to 
run after another tool (which needs the uncompressed input file) - 
you should connect this tool carefully in your workflow.

Making the output format "gzip" ensures the new compressed files can't be used 
with any regular tool.
Then create a similar "uncompress" tool that does the opposite.

>> And this brings me to the second problem. Getting large files out
>> of Galaxy. The only way to save large files from Galaxy (that I can
>> see) is the save icon, which downloads the file via http. This take
>> *ages* for a large file and also causes big headaches for my
>> firefox browser. 

Here are three solutions (in varying level of ugliness) to get files out of 
galaxy:

1. This simple tool will tell you the full path of your dataset:
=========
<tool id="cshl_get_dataset_full_path" name="Get dataset full path">
  <description>for advanced users only!</description>
  <command>readlink -f '$input' &gt; '$output'</command>
  <inputs>
    <param format="data" name="input" type="data" label="Show full path of 
dataset" />
  </inputs>
  <outputs>
    <data format="txt" name="output" />
  </outputs>
  <help>
**What it does**
DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
  </help>
</tool>
=========

run it on any input dataset, the output will contain the full path on your 
local system.
It goes without saying that this is a security hazard, and only use this tool 
if you know what you're doing, and you trust your users.
Once you have the full path, just access the file directly out of galaxy.


2. The following tool allows the user to export a dataset into a hard-coded 
directory (/tmp/galaxy_export in this example).
This is just a proof of concept, and for a production environment you'll need 
to add validators to the "description" variable to prevent users from adding 
unwanted characters.
But it works - once the tool is run, the selected dataset will appear under 
/tmp/galaxy_export/$USER/ .
=============
<tool id="cshl_export_to_local" name="Export to local file">
        <description>for advanced users only!</description>
        <command>
                mkdir -p /tmp/galaxy_export/$userEmail &amp;&amp;
                ln -s '$input' 
'/tmp/galaxy_export/$userEmail/${input.hid}_${description}.${input.extension}'
        </command>
        <inputs>
                <param format="data" name="input" type="data" label="Dataset to 
Export" />
                <param name="description" type="text" size="30" label="File 
name" />
        </inputs>
        <outputs>
                <data format="txt" name="output" />
        </outputs>
        <help>
                **What it does**

                DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
        </help>
</tool>
================


3. Last but not least, if you have access to the database, getting the dataset 
path is easy if you now the dataset number or the dataset hash-id (and you have 
them as links on the galaxy web page). This solution is not for the faint of 
heart, but if you want I show examples of how to get from one to the other.


-gordon
_______________________________________________
To manage your subscriptions to this and other Galaxy lists, please use the 
interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

Reply via email to