https://bugs.kde.org/show_bug.cgi?id=374591
--- Comment #6 from Mario Frank <[email protected]> --- Created attachment 103766 --> https://bugs.kde.org/attachment.cgi?id=103766&action=edit This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage. This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage. The stage in which the garbage collector is run is before the rebuilding of thumbnails. Description of problems and approach. The garbage that bloated my databases was quite annoying. I want to draw a sketch about where the garbage comes from: 1) Move image to trash. Everytime I delete some image, the Images table entry is set to status Removed (3) and the original album id is removed from the entry. If I restore the image from trash, a new Images entry is generated. 2) Deleting images/albums directly or from trash. The image files are deleted from hard drive but the Images entries are not removed (and also ImageTagProperties and so on. 3) Moving/renaming images creates a new Images entry. The old one is set to status Removed. 4) Deleting images does not remove the thumbnails. (by path/ uniqueHash) 5) Deleting face regions does not remove the region thumbnails (custom identifier) 6) Removing tag does not remove identities from recognition db (every identity should have the same faceEngineUUID as a tag) Here is a description about what the patch does: 1) Creating less junk: I introduced a new item status "Obsolete" and renamed the status "Removed" to "Trashed". Items are set to status Trashed if they are moved to trash. If items are deleted directly/permanently, they get the status "Obsolete". If an image is restored from trash, i search for an item entry that has status Trashed and has the same properties as the new one. If i find such one, I use this entry and set the new/old album and the status to visible. If an image is renamed/moved, I use the moveItem method of the core DB to set the new album/name of the image. This way, the ImageScanner does not think that this is a new image. The old entry is reused. This could solve the grouping problem. I cannot solve points 4 to 6 in the same easy way, explicitely not the thumbnails problem since thumbnails can be referenced by image path, image uniqueHash/file size and image path/face region (custom identifier) Thus, I made some clean-up routine for our databases. 2) Collecting junk: I implemented the DbCleaner Maintenance module. It runs at every start of digikam (if configured so in setup->misc) and removes all stale Images entries (detectable by status Obsolete). This does not take much time. But the DbCleaner can do more. In Maintenance dialog, I added a stage Database Cleanup, which can be triggered. The stage can also clean the thumbs db and recognition db. But this must be explicitely selected in the menu as this can take more time. Also, the thumbs and recognition db are never cleaned at the start of digikam. I do not want our users to wait minutes until they can work. Now to what the DbCleaner does. As already said, it removes the stale images. But let's take a step back. Getting the stale images is just one call to core db. Getting the stale thumbnails is more complicated. Getting the stale face identities is less complicated. In first phase of the DbCleaner I analyse the databases (thumbs and recognition only if enabled). Identities are stale if there is no tag in core db that has the same faceEngineUUID as the identity. Thumbnails are stale if the following holds: 1) There is no image in core DB whose file path leads to that thumbnail (FilePaths table) 2) There is no image in core DB whose uniqueHash and file size leads to that thumbnail (UniqueHashes table) 3) There is no face region of an image whose custom identifier (image file path + region) leads to that thumbnail (CustomIdentifiers table) So I first get all thumbnail ids from thumbs db into a list A and all image ids into another list B. Then I get the thumbnail ids for every image by their file path, uniqueHash/file size and also the thumbnails for the face regions and remove those thumbnail ids from my list A. The remainder in the list is thus neither connected to an image/video nor to face regions. I know that this is no really efficient way. But if, let's say a face region is deleted, I cannot delete the thumbnail since it could still be used from some other image by file path for example. When I am done with that, I first clean the core db (stale images) After that, I clean the thumbs db and after that the recognition db. So far for the main process. The progress is shown to the user and I show, what currently is done (analyse, clean core DB, clean thumbs DB, clean recognition DB). Then I tested my implementation with my database. I have got about 40000 images and my thumbnail db contains 96000 FilePath entries, 205000 UniqueHashes entries and 180000 CustomIdentifier entries. The Thumbnails entries are about 255000. File size of the database (SQLite) file is 2.9 GB About 200000 of the thumbnails are recognised as stale. Removing the thumbnails one by one per thread as it is done for thumbnail generation for example took exorbitant amount of time. To be frank (pun intended), I gave up after 2 hours. The context swithing while multi-threaded access to the thumbs db does not seem to work (well) is a pain. Then I adopted my cleaner threads to work in chunks, with chunk size modifiable (currently hard-coded). This worked better but still too much threads that wait for IO. Against all expectations, I found out that completely sequential cleaning is the fastest. If i let one worker thread remove all stale thumbnails, the process of cleaning my complete databases takes only about 8 mins on my 3 years old core i5 (with 16 GB RAM and no SSD). After vacuuming, my thumbs db has only size of 650 MB. -- You are receiving this mail because: You are watching all bug changes.
