Dear galaxy developers,

The question in short:
How can galaxy user data (e.g. file_path) be stored safely on a file system where files have a limited life time?

Galaxy will run on a cluster (~2000 cores) head node where
data is stored at three points:

/home   (Individual User Homes) with 50 GB quota
/data   (Research Group Directories) with 4 up to 100 TB quota
/work   (User Directories - Temporary File Area) 60 days life time

We plan to store galaxy related data as follows:

/work/galaxy/files <- file_path
/work/galaxy/tmp   <- new_file_path
/work/galaxy/jobs  <- job_working_directory

As a note: Storing these in /data/ would undermine our quota system which our admins do not like.

/data/galaxy will contain the galaxy installation including tool data (and I hope that we can just set the quotas high enough to never run out of space).

Data libraries will be added using the "link mechanism" from /home/USER and /data/GROUP. I hope that I can automatize import and appropriate setting of permissions via the API / bioblend. Are there already scripts?

Is this scheme reasonable?

If yes: The main question is how I can guarantee that the life time of data of /work/ and the galaxy server play nice together.

My idea consists of two parts:

1. Adapt cleanup_datasets.py (i.e. the function purge_histories) such that all histories (also those that have not been deleted) are purged which are at the file system life time.
The modification seems to be to remove the test:
app.model.History.table.c.deleted == true()
At the same time the included data sets will be purged.

2. Using the API I will get the update time of each history or the update time of the youngest included data set (or is it the same anyway). For the files corresponding to the included data sets I will update the access times in the file systems. Such I will guarantee that only complete histories are purged.

The script(s) can then be run via cron with a life time set to 1 day less than the file system life time (just to be sure).

In theory jobs could run longer than 60 days. Therefore my idea would be to update access times of all files in job_working_directory daily.

Thank you very much for any help.

Best,
Matthias


--

-------------------------------------------
Matthias Bernt
Bioinformatics Service
Molekulare Systembiologie (MOLSYB)
Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
Helmholtz Centre for Environmental Research GmbH - UFZ
Permoserstraße 15, 04318 Leipzig, Germany
Phone +49 341 235 482296,
m.be...@ufz.de, www.ufz.de

Sitz der Gesellschaft/Registered Office: Leipzig
Registergericht/Registration Office: Amtsgericht Leipzig
Handelsregister Nr./Trade Register Nr.: B 4703
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig Wilfried Kraus
Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
Prof. Dr. Dr. h.c. Georg Teutsch
Administrative Geschäftsführerin/ Administrative Managing Director:
Prof. Dr. Heike Graßmann
-------------------------------------------



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to