dataproblems opened a new issue, #12116:
URL: https://github.com/apache/hudi/issues/12116
**Describe the problem you faced**
I am unable to create a hudi table using the data that I have with
POPULATE_META_FIELDS being enabled. I can create the table with
POPULATE_META_FIELDS set to false.
**To Reproduce**
Steps to reproduce the behavior:
1. val data = spark.read.parquet("...")
2. data.write.format("hudi").options(HudiOptions).save("")
There are a total of 68 billion unique record keys and my total dataset is
around 5TB.
**Expected behavior**
I should be able to create the table without any exceptions
**Environment Description**
* Hudi version : 0.15.0, 1.0.0-beta1, 1.0.0-beta2
* Spark version : 3.3, 3.4
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Spark submit command:
```
spark-submit --master yarn --deploy-mode client --conf
"spark.driver.extraJavaOptions=-XX:NewSize=1g -XX:SurvivorRatio=2
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf
"spark.executor.extraJavaOptions=-XX:NewSize=1g -XX:SurvivorRatio=2
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --packages
org.apache.hudi:hudi-spark3.3-bundle_2.12:0.15.0,org.apache.hudi:hudi-aws:0.15.0
--class s
omeClassName--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar --conf
spark.executor.heartbeatInterval=900s --conf spark.network.timeout=1000s --conf
spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true
--conf spark.driver.maxResultSize=0 someJarFile
```
Hudi Bulk Insert Options:
I've tried GLOBAL, PARTITION_SORT, and NONE => all result in the same error.
```
val BulkWriteOptions: Map[String, String] = Map(
DataSourceWriteOptions.OPERATION.key() ->
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL, // Configuration for bulk
insert
DataSourceWriteOptions.TABLE_TYPE.key() ->
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, // Table type
HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME.key() -> "snappy",
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE
.key() -> "2147483648",
"hoodie.parquet.small.file.limit" -> "1073741824",
HoodieTableConfig.POPULATE_META_FIELDS.key() -> "true",
HoodieWriteConfig.BULK_INSERT_SORT_MODE
.key() -> BulkInsertSortMode.NONE
.name(),
HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key() -> "true",
HoodieIndexConfig.INDEX_TYPE.key() -> "RECORD_INDEX", /
"hoodie.metadata.record.index.enable" -> "true",
"hoodie.metadata.enable" -> "true",
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.clustering.inline" -> "true",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "2147483648",
"hoodie.clustering.plan.strategy.small.file.limit" -> "1073741824"
)
```
**Stacktrace**
This is the piece / stage that fails:
```
[save at
DatasetBulkInsertCommitActionExecutor.java:81](https://p-3bp2pob2ivree-shs.emrappui-prod.us-east-1.amazonaws.com/shs/history/application_1729032267977_0010/stages/stage/?id=19&attempt=0)
+details
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
org.apache.hudi.commit.DatasetBulkInsertCommitActionExecutor.doExecute(DatasetBulkInsertCommitActionExecutor.java:81)
org.apache.hudi.commit.BaseDatasetBulkInsertCommitActionExecutor.execute(BaseDatasetBulkInsertCommitActionExecutor.java:101)
org.apache.hudi.HoodieSparkSqlWriterInternal.bulkInsertAsRow(HoodieSparkSqlWriter.scala:924)
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:466)
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
```
The executors have these errors:
```
24/10/16 20:13:38 ERROR BulkInsertDataInternalWriterHelper: Global error
thrown while trying to write records in HoodieRowCreateHandle
org.apache.hudi.exception.HoodieRemoteException: Failed to create marker
file somePartition=PartitionName/some_parquet_file_name.parquet.marker.CREATE
Connect to ip-10-0-160-126.ec2.internal:45651
[ip-10-0-160-126.ec2.internal/10.0.160.126] failed: Connection timed out
(Connection timed out)
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:201)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.create(TimelineServerBasedWriteMarkers.java:157)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:67)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.io.storage.row.HoodieRowCreateHandle.createMarkerFile(HoodieRowCreateHandle.java:281)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.io.storage.row.HoodieRowCreateHandle.<init>(HoodieRowCreateHandle.java:144)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.createHandle(BulkInsertDataInternalWriterHelper.java:217)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.getRowCreateHandle(BulkInsertDataInternalWriterHelper.java:203)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.write(BulkInsertDataInternalWriterHelper.java:125)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.spark3.internal.HoodieBulkInsertDataInternalWriter.write(HoodieBulkInsertDataInternalWriter.java:62)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.spark3.internal.HoodieBulkInsertDataInternalWriter.write(HoodieBulkInsertDataInternalWriter.java:38)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1550)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.scheduler.Task.run(Task.scala:138)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_422]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_422]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_422]
Caused by: org.apache.hudi.org.apache.http.conn.HttpHostConnectException:
Connect to ip-10-0-160-126.ec2.internal:45651
[ip-10-0-160-126.ec2.internal/10.0.160.126] failed: Connection timed out
(Connection timed out)
at
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:247)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:198)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
... 21 more
Caused by: java.net.ConnectException: Connection timed out (Connection timed
out)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_422]
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
~[?:1.8.0_422]
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
~[?:1.8.0_422]
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
~[?:1.8.0_422]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
~[?:1.8.0_422]
at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_422]
at
org.apache.hudi.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:247)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
at
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:198)
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
... 21 more
```
I see exit code 137 in the driver logs, OOM: Java Heap Space in the stdout
logs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]