[
https://issues.apache.org/jira/browse/OAK-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676717#comment-16676717
]
Wim Symons commented on OAK-7859:
---------------------------------
Hi Amit,
We made the change to run collectGarbage(false, false), but the DSGC has
significantly degraded in performance.
It now takes 3.5 hours instead of 1.5 minutes on our development instance.
Here is the output of the data of the BlobGarbageCollection MBean for that
environment:
|GlobalMarkStats|
|org.apache.jackrabbit.oak.plugins.blob.BlobGC|
|markEndTime|markStartTime|numReferences|referenceFileSizeBytes|referencesFileSize|repositoryId|
|Tue Nov 06 12:00:36 UTC 2018|Tue Nov 06 12:00:35 UTC 2018|29040|2039209|2.0
MB|f1945c73-291f-4edd-81db-6f62f9830e9a *|
|Tue Nov 06 12:00:31 UTC 2018|Tue Nov 06 12:00:29 UTC 2018|24192|1692187|1.7
MB|fb62b285-8e18-4175-a032-eb2e8280f4e1|
|
During the first (false, false) run, it logged about 1.5 million errors like
this:
{code:java}
05.11.2018 15:29:24.096 *WARN* [sling-oak-observation-1023]
org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector Error occurred
while deleting blob with id
[ffff2233170c4a5055874998e065ab0996100ebe057244dbe86e984d3695290a#38263]
org.apache.jackrabbit.core.data.DataStoreException:
com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon
S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 10F4F1117A8EFD06),
S3 Extended Request ID:
OaIZl7m5oh3Onzm7fqILYz3qwIv4ZRfkFFOuMUb2I+tSbGexRiEH10dEBdjayIiEb1Y/377NSyU=
at
org.apache.jackrabbit.oak.blob.cloud.s3.S3Backend.getRecord(S3Backend.java:515)
at
org.apache.jackrabbit.oak.plugins.blob.AbstractSharedCachingDataStore.getRecordForId(AbstractSharedCachingDataStore.java:435)
at
org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.getRecordForId(DataStoreBlobStore.java:519)
at
org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.countDeleteChunks(DataStoreBlobStore.java:425)
at
org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector$BlobCollectionType$1.sweepInternal(MarkSweepGarbageCollector.java:736)
at
org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector.sweep(MarkSweepGarbageCollector.java:400)
at
org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector.markAndSweep(MarkSweepGarbageCollector.java:266)
at
org.apache.jackrabbit.oak.plugins.blob.MarkSweepGarbageCollector.collectGarbage(MarkSweepGarbageCollector.java:178)
at org.apache.jackrabbit.oak.plugins.blob.BlobGC$2.call(BlobGC.java:103)
at org.apache.jackrabbit.oak.plugins.blob.BlobGC$2.call(BlobGC.java:99)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID:
10F4F1117A8EFD06)
at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
at
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at
com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at
com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at
org.apache.jackrabbit.oak.blob.cloud.s3.S3Backend.getRecord(S3Backend.java:502)
... 13 common frames omitted
{code}
I thought this would only happen during the first run, but the second run is
doing exactly the same.
When I check the META folder of our datastore, it contains this:
!Screenshot 2018-11-06 at 13.44.15.png!
Any clue why this happens?
> S3 Bucket iterator stops too early
> ----------------------------------
>
> Key: OAK-7859
> URL: https://issues.apache.org/jira/browse/OAK-7859
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: blob-cloud
> Affects Versions: 1.6.6
> Reporter: Wim Symons
> Assignee: Amit Jain
> Priority: Critical
> Labels: newbie, pull-request-available
> Attachments: META-ids.txt, Screenshot 2018-11-06 at 13.44.15.png,
> dsgc-bis.zip, dsgc.log, meta-info.txt
>
>
> Fixed a major bug in the S3 bucket iterator.
> When the returned queue of records is empty due to the fact that we get a
> full page of records starting with the META/ key, the iterator stops while
> there is still data available in the bucket.
> This causes problems with datastore GC, and datastore consistency checks
> (both online and offline), and possibly even more.
> A little explainer. But based on a batch size of 2 instead of 1000.
> Suppose your list of S3 keys looks as follows:
> * 1
> * 2
> * 3
> * 4
> * META/1
> * META/2
> * 5
> * 6
> loadBatch would first load [1, 2], filter out no META/ keys and pass [1, 2]
> to the caller.
> Next time, loadBatch would load [3, 4], filter out no META/ keys and pass [3,
> 4] to the caller.
> Than, loadBatch would load [META/1, META/2], filter out the META/ keys and
> pass [] to the caller.
> When that happens, traversing the bucket would stop, because the returned
> list is empty, even if there are many more batches to load.
> The fix checks if the returned list is empty and there are more batches
> available, it would load (a) new batch(es) until there is data in the batch
> or there is no more batch available.
> We are currently running Oak 1.6.6 on AEM 6.3.1.2, but as the bug is still in
> trunk, all previous versions of Oak are affected as well.
> I provided 2 pull requests: one for trunk
> ([https://github.com/apache/jackrabbit-oak/pull/103)] and one for the 1.6
> branch ([https://github.com/apache/jackrabbit-oak/pull/104).]
> CI failed on [https://github.com/apache/jackrabbit-oak/pull/103,] but I don't
> think it's related to my changes.
> For the record, the patch works as I was able to successfully test this on
> our production repository using oak-run --id. With version 1.6.6 it reported
> 800k items, with my patched version, it reported 1.8m items. (As our META/
> nodes are listed somewhere half-way through.)
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)