[ 
https://issues.apache.org/jira/browse/HADOOP-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195055#comment-17195055
 ] 

David Kats commented on HADOOP-17180:
-------------------------------------

Hi Steve,

thank you for your answer.

We can't easily upgrade to 3.2.0 just to try - 

we are running a few large systems over tens of thousands of cores 

Also, HADOOP-15426  doesn't seem to address 500 system errors.

If treating 500 as a throttle event gets addressed on 3.3, 

we'll move to 3.3 (e.g. this fix doesn't have to be back-ported to 3.1).

 

We are constantly running into this issue with jobs dying and all the work with 
AWS so far yields nothing, looks like this should be addressed on the S3Guard 
side.

Other than that, S3Guard works great for us, thanks a lot for a solid product :)

Appreciate your help,

David

 

> S3Guard: Include 500 DynamoDB system errors in exponential backoff retries
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-17180
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17180
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.3
>            Reporter: David Kats
>            Priority: Major
>         Attachments: image-2020-08-03-09-58-54-102.png
>
>
> We get fatal failures from S3guard (that in turn fail our spark jobs) because 
> of the inernal DynamoDB system errors.
> {color:#000000}com.amazonaws.services.dynamodbv2.model.InternalServerErrorException:
>  Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error 
> Code: InternalServerError; Request ID: 
> 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG): Internal server error 
> (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: 
> InternalServerError; Request ID: 
> 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG){color}
> {color:#000000}The DynamoDB has separate statistic for system errors:{color}
> {color:#000000}!image-2020-08-03-09-58-54-102.png!{color}
> {color:#000000}I contacted the AWS Support and got an explanation that those 
> 500 errors are returned to the client once DynamoDB gets overwhelmed with 
> client requests.{color}
> {color:#000000}So essentially the traffic should had been throttled but it 
> didn't and got 500 system errors.{color}
> {color:#000000}My point is that the client should handle those errors just 
> like throttling exceptions - {color}
> {color:#000000}with exponential backoff retries.{color}
>  
> {color:#000000}Here is more complete exception stack trace:{color}
>  
> *{color:#000000}org.apache.hadoop.fs.s3a.AWSServiceIOException: get on 
> s3a://rem-spark/persisted_step_data/15/0afb1ccb73854f1fa55517a77ec7cc5e__b67e2221-f0e3-4c89-90ab-f49618ea4557__SDTopology/parquet.all_ranges/topo_id=321:
>  com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: 
> Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error 
> Code: InternalServerError; Request ID: 
> 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG): Internal server error 
> (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: 
> InternalServerError; Request ID: 
> 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG) 
> at{color}*{color:#000000} 
> org.apache.hadoop.fs.s3a.S3AUtils.translateDynamoDBException(S3AUtils.java:389)
>  at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:181) 
> at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111) at 
> org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.get(DynamoDBMetadataStore.java:438)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2110)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) 
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1889)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$9(S3AFileSystem.java:1868)
>  at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1868) at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:277)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:207)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:206)
>  at scala.collection.immutable.Stream.map(Stream.scala:418) at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:204)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: 
> Internal server error (Service: AmazonDynamoDBv2; Status Code: 500; Error 
> Code: InternalServerError; Request ID: 
> 00EBRE6J6V8UGD7040C9DUP2MNVV4KQNSO5AEMVJF66Q9ASUAAJG) at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2925)
>  at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2901)
>  at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1640)
>  at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1616)
>  at 
> com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.doLoadItem(GetItemImpl.java:77)
>  at 
> com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItem(GetItemImpl.java:66)
>  at com.amazonaws.services.dynamodbv2.document.Table.getItem(Table.java:608) 
> at 
> org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.getConsistentItem(DynamoDBMetadataStore.java:423)
>  at 
> org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.innerGet(DynamoDBMetadataStore.java:459)
>  at 
> org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.lambda$get$2(DynamoDBMetadataStore.java:439)
>  at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) ... 29 more{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to