[ 
https://issues.apache.org/jira/browse/BEAM-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944295#comment-15944295
 ] 

Davor Bonaci commented on BEAM-1735:
------------------------------------

This is a batched copy request to Cloud Storage during FileBasedSink 
finalization.

Stack trace from an SDK that was built pre-Beam package rename:
java.io.IOException: Error executing batch GCS request
        at 
com.google.cloud.dataflow.sdk.util.GcsUtil.executeBatches(GcsUtil.java:414)
        at com.google.cloud.dataflow.sdk.util.GcsUtil.copy(GcsUtil.java:421)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$GcsOperations.copy(FileBasedSink.java:663)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriteOperation.copyToOutputFiles(FileBasedSink.java:378)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriteOperation.finalize(FileBasedSink.java:342)
        at 
com.google.cloud.dataflow.sdk.io.Write$Bound$2.processElement(Write.java:416)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error 
trying to copy [cut]/part-14135-of-22733.protobuf.avro: 
{"code":403,"errors":[{"domain":"usageLimits","message":"User Rate Limit 
Exceeded","reason":"userRateLimitExceeded"}],"message":"User Rate Limit 
Exceeded"}
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476)
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:455)
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
        at 
com.google.cloud.dataflow.sdk.util.GcsUtil.executeBatches(GcsUtil.java:409)
        at com.google.cloud.dataflow.sdk.util.GcsUtil.copy(GcsUtil.java:421)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$GcsOperations.copy(FileBasedSink.java:663)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriteOperation.copyToOutputFiles(FileBasedSink.java:378)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriteOperation.finalize(FileBasedSink.java:342)
        at 
com.google.cloud.dataflow.sdk.io.Write$Bound$2.processElement(Write.java:416)
        at 
com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
        at 
com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
        at 
com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
        at 
com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerLoggingParDoFn.processElement(DataflowWorkerLoggingParDoFn.java:47)
        at 
com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:55)
        at 
com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
        at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:221)
        at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
        at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:172)
        at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:159)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error trying to copy 
[cut]/part-14135-of-22733.protobuf.avro: 
{"code":403,"errors":[{"domain":"usageLimits","message":"User Rate Limit 
Exceeded","reason":"userRateLimitExceeded"}],"message":"User Rate Limit 
Exceeded"}
        at 
com.google.cloud.dataflow.sdk.util.GcsUtil$3.onFailure(GcsUtil.java:484)
        at 
com.google.api.client.googleapis.batch.json.JsonBatchCallback.onFailure(JsonBatchCallback.java:54)
        at 
com.google.api.client.googleapis.batch.json.JsonBatchCallback.onFailure(JsonBatchCallback.java:50)
        at 
com.google.api.client.googleapis.batch.BatchUnparsedResponse.parseAndCallback(BatchUnparsedResponse.java:223)
        at 
com.google.api.client.googleapis.batch.BatchUnparsedResponse.parseNextResponse(BatchUnparsedResponse.java:155)
        at 
com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:253)
        at com.google.cloud.dataflow.sdk.util.GcsUtil$2.call(GcsUtil.java:402)
        at com.google.cloud.dataflow.sdk.util.GcsUtil$2.call(GcsUtil.java:400)
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at 
com.google.cloud.dataflow.sdk.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        ... 3 more

Current retry logic:
* No retry at individual copy, as part of the batched request.
* Retry exists at batched request level.
* Retry logic exists at bundle level.

We could theoretically add logic for extracting individual failed copies from 
the batched requests, keep them, and try to retry only those that failed (in 
the same bundle). This would be a complex, error-prone logic. Hard to think of 
another failure type where this would be beneficial -- except quota on a small 
time window. I can be convinced in the value of low-level fix if it can be 
shown that the probability of GCS failing a portion of the batched copy is high 
enough.


> Retry 403:rateLimitExceeded in GCS
> ----------------------------------
>
>                 Key: BEAM-1735
>                 URL: https://issues.apache.org/jira/browse/BEAM-1735
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>            Reporter: Rafael Fernandez
>            Priority: Minor
>
> The GCS documentation [1] states that rateLimitExceeded, a 403 error, should 
> be retried exponentially. We currently do not retry it.
> [1] https://cloud.google.com/storage/docs/json_api/v1/status-codes 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to