dataproblems opened a new issue, #12116:
URL: https://github.com/apache/hudi/issues/12116

   **Describe the problem you faced**
   
   I am unable to create a hudi table using the data that I have with 
POPULATE_META_FIELDS being enabled. I can create the table with 
POPULATE_META_FIELDS set to false.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. val data = spark.read.parquet("...")
   2. data.write.format("hudi").options(HudiOptions).save("")
   
   There are a total of 68 billion unique record keys and my total dataset is 
around 5TB. 
   
   **Expected behavior**
   
   I should be able to create the table without any exceptions
   
   **Environment Description**
   
   * Hudi version : 0.15.0, 1.0.0-beta1, 1.0.0-beta2
   
   * Spark version : 3.3, 3.4
   
   * Hive version :
   
   * Hadoop version : 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Spark submit command: 
   
   ```
   spark-submit --master yarn --deploy-mode client --conf 
"spark.driver.extraJavaOptions=-XX:NewSize=1g -XX:SurvivorRatio=2 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf 
"spark.executor.extraJavaOptions=-XX:NewSize=1g -XX:SurvivorRatio=2 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --packages 
org.apache.hudi:hudi-spark3.3-bundle_2.12:0.15.0,org.apache.hudi:hudi-aws:0.15.0
 --class s
 omeClassName--conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
--conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar --conf 
spark.executor.heartbeatInterval=900s --conf spark.network.timeout=1000s --conf 
spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true 
--conf spark.driver.maxResultSize=0 someJarFile
   ```
   
   Hudi Bulk Insert Options: 
   
   I've tried GLOBAL, PARTITION_SORT, and NONE => all result in the same error. 
   ```
   val BulkWriteOptions: Map[String, String] = Map(
       DataSourceWriteOptions.OPERATION.key() -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL, // Configuration for bulk 
insert
       DataSourceWriteOptions.TABLE_TYPE.key() -> 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, // Table type
       HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME.key() -> "snappy", 
       HoodieStorageConfig.PARQUET_MAX_FILE_SIZE
         .key() -> "2147483648", 
       "hoodie.parquet.small.file.limit" -> "1073741824",
       HoodieTableConfig.POPULATE_META_FIELDS.key() -> "true", 
       HoodieWriteConfig.BULK_INSERT_SORT_MODE
         .key() -> BulkInsertSortMode.NONE
         .name(), 
       HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key() -> "true", 
       HoodieIndexConfig.INDEX_TYPE.key() -> "RECORD_INDEX", /
       "hoodie.metadata.record.index.enable" -> "true", 
       "hoodie.metadata.enable" -> "true", 
       "hoodie.datasource.write.hive_style_partitioning" -> "true", 
       "hoodie.clustering.inline" -> "true", 
       "hoodie.clustering.plan.strategy.target.file.max.bytes" -> "2147483648",
       "hoodie.clustering.plan.strategy.small.file.limit" -> "1073741824"
     )
   ```
   
   **Stacktrace**
   
   This is the piece / stage that fails: 
   
    ```
    [save at 
DatasetBulkInsertCommitActionExecutor.java:81](https://p-3bp2pob2ivree-shs.emrappui-prod.us-east-1.amazonaws.com/shs/history/application_1729032267977_0010/stages/stage/?id=19&attempt=0)
 +details
   
   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
   
org.apache.hudi.commit.DatasetBulkInsertCommitActionExecutor.doExecute(DatasetBulkInsertCommitActionExecutor.java:81)
   
org.apache.hudi.commit.BaseDatasetBulkInsertCommitActionExecutor.execute(BaseDatasetBulkInsertCommitActionExecutor.java:101)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.bulkInsertAsRow(HoodieSparkSqlWriter.scala:924)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:466)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
   org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
   org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
   
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
   
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
   
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
   
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
   ```
   
   The executors have these errors: 
   
   ```
   24/10/16 20:13:38 ERROR BulkInsertDataInternalWriterHelper: Global error 
thrown while trying to write records in HoodieRowCreateHandle 
   org.apache.hudi.exception.HoodieRemoteException: Failed to create marker 
file somePartition=PartitionName/some_parquet_file_name.parquet.marker.CREATE
   Connect to ip-10-0-160-126.ec2.internal:45651 
[ip-10-0-160-126.ec2.internal/10.0.160.126] failed: Connection timed out 
(Connection timed out)
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:201)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.create(TimelineServerBasedWriteMarkers.java:157)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:67) 
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.io.storage.row.HoodieRowCreateHandle.createMarkerFile(HoodieRowCreateHandle.java:281)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.io.storage.row.HoodieRowCreateHandle.<init>(HoodieRowCreateHandle.java:144)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.createHandle(BulkInsertDataInternalWriterHelper.java:217)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.getRowCreateHandle(BulkInsertDataInternalWriterHelper.java:203)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper.write(BulkInsertDataInternalWriterHelper.java:125)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.spark3.internal.HoodieBulkInsertDataInternalWriter.write(HoodieBulkInsertDataInternalWriter.java:62)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.spark3.internal.HoodieBulkInsertDataInternalWriter.write(HoodieBulkInsertDataInternalWriter.java:38)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
 ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1550)
 ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
 ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
 ~[spark-sql_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at org.apache.spark.scheduler.Task.run(Task.scala:138) 
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
 ~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) 
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) 
~[spark-core_2.12-3.3.2-amzn-0.jar:3.3.2-amzn-0]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_422]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_422]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_422]
   Caused by: org.apache.hudi.org.apache.http.conn.HttpHostConnectException: 
Connect to ip-10-0-160-126.ec2.internal:45651 
[ip-10-0-160-126.ec2.internal/10.0.160.126] failed: Connection timed out 
(Connection timed out)
        at 
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) 
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:247)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:198)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        ... 21 more
   Caused by: java.net.ConnectException: Connection timed out (Connection timed 
out)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_422]
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
~[?:1.8.0_422]
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 ~[?:1.8.0_422]
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
~[?:1.8.0_422]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[?:1.8.0_422]
        at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_422]
        at 
org.apache.hudi.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) 
~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:247)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:198)
 ~[org.apache.hudi_hudi-spark3.3-bundle_2.12-0.15.0.jar:0.15.0]
        ... 21 more
   ```
   
   I see exit code 137 in the driver logs, OOM: Java Heap Space in the stdout 
logs. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to