[I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

via GitHub Tue, 26 Dec 2023 21:51:15 -0800


ergophobiac opened a new issue, #10415:
URL: https://github.com/apache/hudi/issues/10415


   **Describe the problem you faced**
   
   Stack: Hudi 0.13.1, EMR 6.13.0, Spark 3.4.1
   
   We are writing to an MOR table in S3, using Spark Structured Streaming job 
on EMR. Once this job has run for a while (like 12+ hours) we notice that at 
random times, there are long periods of time with no operations/commits on the 
timeline. Usually, the offending executor is removed and the task retried, but 
eventually, it stalls and the application exits with fatal errors on the driver.
   
   We are using deployment model B: Single writer with async services + OCC for 
metadata table. We also tried using DynamoDB based lock provider (in case OCC 
was the culprit) and the same thing happened.
   
   Sometimes, a deltacommit, clean or compaction is inflight but never 
completed, and our logs are reporting the same error over and over:
   
   (on Spark driver stderr)
   
![image](https://github.com/apache/hudi/assets/122294902/886ac590-c989-4006-9523-233e9ecb5866)
   
   The timeout always occurs while trying to read column stats.
   
   Meanwhile, the timeline shows no progress:
   
![image](https://github.com/apache/hudi/assets/122294902/483b8af8-5cf5-4b11-be79-7969ea4387eb)
   
   Notice that the deltacommit inflight at 10:20:01 AM is stuck (all prior 
deltacommits were finished within 45 seconds). The micro-batch interval is 5 
mins.
   
   We can see no active tasks on the Spark Web UI:
   
![image](https://github.com/apache/hudi/assets/122294902/5159aeda-189d-4032-9719-8bb7826e9e8f)
   
   
   In some instances, the tasks will succeed eventually and the job will 
progress ahead, until it gets stuck again with the same errors. 
   
   Here are all our Hudi configs:
   
   hoodie.table.version                                                     ->  
5
   hoodie.datasource.write.hive_style_partitioning           ->  True
   hoodie.datasource.hive_sync.enable                           ->  True
   hoodie.datasource.hive_sync.auto_create_database  ->  True
   hoodie.datasource.hive_sync.skip_ro_suffix                ->  True
   hoodie.parquet.small.file.limit                                       ->  
104857600
   hoodie.parquet.max.file.size                                         ->  
125829120
   hoodie.compact.inline.trigger.strategy                          ->  
NUM_OR_TIME
   hoodie.compact.inline.max.delta.commits                    ->  3
   hoodie.compact.inline.max.delta.seconds                    ->  600
   hoodie.parquet.compression.codec                              ->  snappy
   hoodie.clean.automatic                                                 ->  
True
   hoodie.index.type                                                          
->  BLOOM
   hoodie.bloom.index.use.metadata                                ->  True
   hoodie.metadata.enable                                               ->  True
   hoodie.metadata.index.bloom.filter.enable                   ->  True
   hoodie.metadata.index.column.stats.enable                ->  True
   hoodie.keep.max.commits                                            ->  50
   hoodie.archive.automatic                                              ->  
True
   hoodie.archive.beyond.savepoint                                 ->  True
   hoodie.metrics.on                                                         -> 
 True
   hoodie.metadata.metrics.enable                                  ->  True
   hoodie.metrics.executor.enable                                    ->  True
   hoodie.metrics.reporter.type                                         ->  
GRAPHITE
   hoodie.metrics.graphite.host                                         ->  
<Graphite-IP>
   hoodie.metrics.graphite.port                                          ->  
2003
   hoodie.metrics.graphite.report.period.seconds             ->  30
   hoodie.metrics.graphite.metric.prefix                            ->  
test_prefix_demo_mor
   hoodie.cleaner.policy.failed.writes                                 ->  LAZY
   hoodie.write.concurrency.mode                                     ->  
OPTIMISTIC_CONCURRENCY_CONTROL
   hoodie.write.lock.provider                                              ->  
org.apache.hudi.client.transaction.lock.InProcessLockProvider
   hoodie.metrics.lock.enable                                            ->  
True
   hoodie.clean.async                                                        -> 
 True
   hoodie.archive.async                                                     ->  
True
   hoodie.metadata.index.async                                        ->  False
   hoodie.metadata.clean.async                                         ->  False
   hoodie.cleaner.policy                                                      
->  KEEP_LATEST_BY_HOURS
   hoodie.cleaner.hours.retained                                        ->  1
   hoodie.datasource.write.table.name                               ->  table_0
   hoodie.table.name                                                          
->  table_0
   hoodie.datasource.write.table.type                                 ->  
MERGE_ON_READ
   hoodie.datasource.write.recordkey.field                          ->  id
   hoodie.datasource.write.partitionpath.field              ->  _year
   hoodie.datasource.write.precombine.field                 ->  _kafka_offset
   hoodie.datasource.hive_sync.table                        ->  table_0
   hoodie.datasource.hive_sync.database                     ->  db-name
   hoodie.base.path                                         ->  
s3://bucket-name/db-name/table_0
   hoodie.copyonwrite.record.size.estimate                  ->  218
   hoodie.parquet.block.size                                ->  125829120
   hoodie.parquet.page.size                                 ->  104876
   hoodie.datasource.write.keygenerator.class               ->  
org.apache.hudi.keygen.SimpleKeyGenerator
   hoodie.datasource.hive_sync.partition_extractor_class    ->  
org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.streaming.checkpoint.identifier  ->  table_0_0
   path                                                     ->  
s3://bucket-name/db-name/table_0
   checkpointLocation                                       ->  
s3://bucket-name/db-name/table_0/checkpoints/
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start incremental upserts to an MOR table on S3 using the above configs 
on the mentioned version of EMR + Hudi + Spark
   
   **Expected behavior**
   
   Writer continues to run as it has usually, without getting stuck.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.4.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Stacktrace**
   
   ```23/12/27 10:54:56 INFO S3NativeFileSystem: Opening 
's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610'
 for reading
   23/12/27 10:55:00 INFO NetworkClient: [AdminClient clientId=adminclient-1] 
Node 6 disconnected.
   23/12/27 10:55:00 INFO MultipartUploadOutputStream: close closed:false 
s3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/.heartbeat/20231227102000654
   23/12/27 10:55:06 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610;
 Range: null
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:16 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610;
 Range: [11825, 11836]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:16 INFO S3NativeFileSystem: Opening 
's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248'
 for reading
   23/12/27 10:55:26 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248;
 Range: [7739, 11846]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:36 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248;
 Range: [6233, 11846]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:46 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248;
 Range: [5292, 5324]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:56 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248;
 Range: [5325, 6265]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:55:56 INFO S3NativeFileSystem: Opening 
's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505'
 for reading
   23/12/27 10:56:00 INFO MultipartUploadOutputStream: close closed:false 
s3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/.heartbeat/20231227102000654
   23/12/27 10:56:06 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505;
 Range: [7738, 11845]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:56:16 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505;
 Range: [6232, 11845]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:56:26 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505;
 Range: [5292, 5324]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   23/12/27 10:56:36 INFO GlobalS3Executor: ReadTimeout File: 
test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505;
 Range: [5325, 6264]
   Use default timeout configuration to retry for read timeout 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Hudi 0.13.1 on EMR, MOR table writer hangs intermittently with S3 read timeout error for column stats index [hudi]

Reply via email to