Hello Assaf,
thank you for the exellent explanation, the situation is become more clearly for me.

06.07.2011 21:19, Assaf Gordon пишет:
Hello Sergei,

I'm experimenting with the clean-up scripts myself, so perhaps I can offer some 
information (the galaxy team is welcomed to correct me and/or explain better).

1. If you look at the output of your query, you'll notice that the "purged" field is 0 
for all datasets (I assume 0 is "false" in MySQL).
This means that the actual files where *not* purged (e.g. physically deleted) - at least not by the 
"purge_datasets.sh" or "cleanup_datasets.py -3" step.
Since you did use "-r" parameter, it means those dataset were not picked-up as 
possible deletion candidates by this script.

2. (The following I found by reading the source code, it's not really well 
explained - so if I'm wrong - correct me).
The "dataset" table has an "update_time" field, and this field is updated 
automatically whenever the dataset record changes.
This means that when you run the first cleanup script and set the "deleted" flag to true, 
the update_time is updated to "now".
When you run the next clean-up script and ask for anything that is older than 1 day ("-d 1"), it 
looks for the update_time older then one day - so it will *not* find the dataset that was just marked as 
"deleted" in the first step (because the update_time is "now"). Only if you run the next 
clean-up script tomorrow, that dataset will be deleted.

So, for example, running the following in succession:
cleanup_datasets.py universe_wsgi.ini -d 1 -6    ( =>  delete datasets )
cleanup_datasets.py universe_wsgi.ini -d 1 -3 -r ( =>  purge datasets + delete 
physical files)

both run with "-d 1" - but by design, files from yesterday (1 day old) will not 
be physically deleted.

Files that the user deleted yesterday (1 day old) will be marked as "deleted", but their 
update_time will by "now".
Only files that were marked as deleted yesterday will be deleted today 
(meaning: they are 2 days old).

To really delete files now, use "-d 0" with all the scripts.
Since this is quite scary, the "-i" (info only) mode will show what what will 
be deleted (but that requires a recent version 5770:a5e0a5d3c0a1).

3. The file_size=NULL issue happen when a job fails - on some occasions (I couldn't 
pinpoint exactly when) galaxy does not pickup the fact the an output file was generated 
even if the job failed, and so you get "ghost" files which exist on the disk 
but are NULL in the database.
The "discard" means the job was discarded (by the galaxy user?) - not that the 
dataset was deleted/purged by the clean-up scripts.

Hope this helps,

Sergei Ryazansky wrote, On 07/06/2011 12:15 PM:
thank you for answer.
I have tried to use the mentioned scripts but it seems that the order of their 
using at first time was incorrect.. As a result, the metadata in database 
tables are modified but the datasets files corresponded to deleted datasets in 
history remains unremoved. So, the following calling of the scripts in the 
right order (as indicated in wiki) also didn't delete the unused dataset files. 
Is there any way to update the metadata in tables according to the real state 
of files?
I think that the order of calling the scripts at first time was the following:
cleanup_datasets.py universe_wsgi.ini -d 1 -6 -r
cleanup_datasets.py universe_wsgi.ini -d 6 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 2 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 3 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 4 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 5 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -1 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -2 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -4 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -5 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -3 -r
cleanup_datasets.py universe_wsgi.ini -d 1 -6 -r

Also there are some strange things (imho) in galaxy.dataset table: there a lot 
of datasets id having or NULL total size:

mysql>  select * from dataset where (id="148" or id="53" or id="86" or id="146" or 
| id | create_time | update_time | state | deleted | purged | purgable | 
external_filename | _extra_files_path | file_size | total_size |
| 53 | 2011-03-29 16:21:58 | 2011-07-06 14:17:49 | error | 1 | 0 | 1 | NULL | 
NULL | 0 | NULL |
| 86 | 2011-03-29 20:35:44 | 2011-07-06 14:17:52 | discarded | 1 | 0 | 1 | NULL 
| 146 | 2011-05-26 01:38:14 | 2011-07-06 14:18:00 | error | 1 | 0 | 1 | NULL | 
| 148 | 2011-05-26 02:20:44 | 2011-07-06 14:18:00 | discarded | 1 | 0 | 1 | 
| 330 | 2011-07-05 00:44:44 | 2011-07-05 00:44:44 | NULL | 0 | 0 | 1 | NULL | 

I don't know how these records looked like before calling of the cleanup scripts, but is 
it possible that it is because of incorrect order of their calling? Is 
"discarded" state mean that the corresponded file should be deleted? But in my 
case all these files are still in database folder.
Please, let me know if you need any other of clarification of my questions.

2011/7/6 Hans-Rudolf Hotz<h...@fmi.ch<mailto:h...@fmi.ch>>

     Hi Sergei

     This is a question better asked on 
'galaxy-...@bx.psu.edu<mailto:galaxy-...@bx.psu.edu>' since you refer to your 
local Galaxy installation.

     In order to remove the data from your file system, you need to run the 
'cleanup scripts', as described on this wiki page:


     Regards, Hans

     On 07/06/2011 03:33 PM, Sergei Ryazansky wrote:

         -------- Исходное сообщение --------
         Тема:   deleting datasets from history
         Дата:   Tue, 5 Jul 2011 19:58:45 +0300
         От:     Sergei 

         Hello all,

         After the deleating datasets from the history panel in our Galaxy 
         the indicator at the top right corner shows the same amount of used
         space as before deleting. Also, the files corresponded to the datasets
         remains in the Galaxy database/files/000 directory. It seems, that
         deleting of datasets from history is only delete the launch to file but
         not the file itself. How to configure the Galaxy mirror to delete not
         only records in history panel but also the corresponed files?
         Thank you in advance!

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to