ergophobiac opened a new issue, #10415: URL: https://github.com/apache/hudi/issues/10415
**Describe the problem you faced** Stack: Hudi 0.13.1, EMR 6.13.0, Spark 3.4.1 We are writing to an MOR table in S3, using Spark Structured Streaming job on EMR. Once this job has run for a while (like 12+ hours) we notice that at random times, there are long periods of time with no operations/commits on the timeline. Usually, the offending executor is removed and the task retried, but eventually, it stalls and the application exits with fatal errors on the driver. We are using deployment model B: Single writer with async services + OCC for metadata table. We also tried using DynamoDB based lock provider (in case OCC was the culprit) and the same thing happened. Sometimes, a deltacommit, clean or compaction is inflight but never completed, and our logs are reporting the same error over and over: (on Spark driver stderr)  The timeout always occurs while trying to read column stats. Meanwhile, the timeline shows no progress:  Notice that the deltacommit inflight at 10:20:01 AM is stuck (all prior deltacommits were finished within 45 seconds). The micro-batch interval is 5 mins. We can see no active tasks on the Spark Web UI:  In some instances, the tasks will succeed eventually and the job will progress ahead, until it gets stuck again with the same errors. Here are all our Hudi configs: hoodie.table.version -> 5 hoodie.datasource.write.hive_style_partitioning -> True hoodie.datasource.hive_sync.enable -> True hoodie.datasource.hive_sync.auto_create_database -> True hoodie.datasource.hive_sync.skip_ro_suffix -> True hoodie.parquet.small.file.limit -> 104857600 hoodie.parquet.max.file.size -> 125829120 hoodie.compact.inline.trigger.strategy -> NUM_OR_TIME hoodie.compact.inline.max.delta.commits -> 3 hoodie.compact.inline.max.delta.seconds -> 600 hoodie.parquet.compression.codec -> snappy hoodie.clean.automatic -> True hoodie.index.type -> BLOOM hoodie.bloom.index.use.metadata -> True hoodie.metadata.enable -> True hoodie.metadata.index.bloom.filter.enable -> True hoodie.metadata.index.column.stats.enable -> True hoodie.keep.max.commits -> 50 hoodie.archive.automatic -> True hoodie.archive.beyond.savepoint -> True hoodie.metrics.on -> True hoodie.metadata.metrics.enable -> True hoodie.metrics.executor.enable -> True hoodie.metrics.reporter.type -> GRAPHITE hoodie.metrics.graphite.host -> <Graphite-IP> hoodie.metrics.graphite.port -> 2003 hoodie.metrics.graphite.report.period.seconds -> 30 hoodie.metrics.graphite.metric.prefix -> test_prefix_demo_mor hoodie.cleaner.policy.failed.writes -> LAZY hoodie.write.concurrency.mode -> OPTIMISTIC_CONCURRENCY_CONTROL hoodie.write.lock.provider -> org.apache.hudi.client.transaction.lock.InProcessLockProvider hoodie.metrics.lock.enable -> True hoodie.clean.async -> True hoodie.archive.async -> True hoodie.metadata.index.async -> False hoodie.metadata.clean.async -> False hoodie.cleaner.policy -> KEEP_LATEST_BY_HOURS hoodie.cleaner.hours.retained -> 1 hoodie.datasource.write.table.name -> table_0 hoodie.table.name -> table_0 hoodie.datasource.write.table.type -> MERGE_ON_READ hoodie.datasource.write.recordkey.field -> id hoodie.datasource.write.partitionpath.field -> _year hoodie.datasource.write.precombine.field -> _kafka_offset hoodie.datasource.hive_sync.table -> table_0 hoodie.datasource.hive_sync.database -> db-name hoodie.base.path -> s3://bucket-name/db-name/table_0 hoodie.copyonwrite.record.size.estimate -> 218 hoodie.parquet.block.size -> 125829120 hoodie.parquet.page.size -> 104876 hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.SimpleKeyGenerator hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.write.streaming.checkpoint.identifier -> table_0_0 path -> s3://bucket-name/db-name/table_0 checkpointLocation -> s3://bucket-name/db-name/table_0/checkpoints/ **To Reproduce** Steps to reproduce the behavior: 1. Start incremental upserts to an MOR table on S3 using the above configs on the mentioned version of EMR + Hudi + Spark **Expected behavior** Writer continues to run as it has usually, without getting stuck. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.4.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Stacktrace** ```23/12/27 10:54:56 INFO S3NativeFileSystem: Opening 's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610' for reading 23/12/27 10:55:00 INFO NetworkClient: [AdminClient clientId=adminclient-1] Node 6 disconnected. 23/12/27 10:55:00 INFO MultipartUploadOutputStream: close closed:false s3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/.heartbeat/20231227102000654 23/12/27 10:55:06 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610; Range: null Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:16 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.51_1-6493-25610; Range: [11825, 11836] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:16 INFO S3NativeFileSystem: Opening 's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248' for reading 23/12/27 10:55:26 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248; Range: [7739, 11846] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:36 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248; Range: [6233, 11846] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:46 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248; Range: [5292, 5324] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:56 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.1_1-4901-17248; Range: [5325, 6265] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:55:56 INFO S3NativeFileSystem: Opening 's3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505' for reading 23/12/27 10:56:00 INFO MultipartUploadOutputStream: close closed:false s3://ecom-data-lake-raw/test_run_demo_7/pickup_manifests/.hoodie/.heartbeat/20231227102000654 23/12/27 10:56:06 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505; Range: [7738, 11845] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:56:16 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505; Range: [6232, 11845] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:56:26 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505; Range: [5292, 5324] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool 23/12/27 10:56:36 INFO GlobalS3Executor: ReadTimeout File: test_run_demo_7/pickup_manifests/.hoodie/metadata/column_stats/.col-stats-0001_20231227075500652001.log.2_1-4947-17505; Range: [5325, 6264] Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
