[ 
https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343773#comment-15343773
 ] 

Thomas Mueller commented on OAK-4430:
-------------------------------------

Some new statistics, retrieving blobs is now about 240 times faster.

* before: 8.3 blobs / second (4 min / 2k blobs) 
* now: 2000 blobs / second (1 sec / 2k blobs)

See also OAK-4200. before this fix, GC took around 37 hours. 99.91% of that 
time was "retrieve of all blobs" (about 2 million). Retrieving references took 
about 1 minute 45 seconds, and sorting and deleting took around 10 seconds. 
With this fix, we can estimate GC will take around 10 minutes for 2 million 
blobs, where retrieving blobs takes around 80% of the time (compared to 99.91% 
before the fix). 


> DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
> --------------------------------------------------------------------
>
>                 Key: OAK-4430
>                 URL: https://issues.apache.org/jira/browse/OAK-4430
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: blob
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>              Labels: candidate_oak_1_2, candidate_oak_1_4
>             Fix For: 1.6, 1.5.4
>
>
> DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the 
> lastModifiedTime criteria is satisfied against the given 
> {{maxLastModifiedTime}}. 
> When the {{maxLastModifiedTime}} has a value 0 it  effectively means ignore 
> any last modified time check (and which is the only usage currently from 
> MarkSweepGarbageCollector). This should ignore fetching the DataRecords as 
> this can be very expensive for e.g on calls to S3 with millions of blobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to