[GitHub] [hudi] noahtaite opened a new issue, #9067: [SUPPORT] Manual Glue sync for large, highly partitioned table failing

via GitHub Tue, 27 Jun 2023 09:34:52 -0700


noahtaite opened a new issue, #9067:
URL: https://github.com/apache/hudi/issues/9067


   **Describe the problem you faced**
   
   I'm trying to do a Hive Sync to AWS Glue metastore for a table with 3 levels 
of partitioning and 150k partitions using hudi-sync-tool from the master node 
of a long-living EMR cluster and running into some issues.
   
   **Background**
   Very large (~100TB) MOR Hudi table with 3 levels of partitioning 
(datasource, year, month) in S3. Prod table bulk inserted + upserted daily by 
Spark on EMR 6.10 clusters + glue table syncs during the Spark job which 
doesn't seem to add too much overhead to the application. Table was copied to a 
new location using S3 bucket replication and need new Glue table created + 
synced for the new location.
   
   **Problem**
   After research, it appears the supported tool for this is hudi-sync-tool, 
which I have been running on an EMR 6.6.0 cluster with Hudi 0.10.1 installed.
   
   The following sync works for smaller tables (<10TB, ~2k partitions)
   ```
   ./hudi-sync-tool --partitioned-by datasource --skip-ro-suffix 
--conditional-sync --base-path s3://bucket/table.all_hudi/ --database test_hudi 
--table table_all --sync-mode hms --partition-value-extractor 
org.apache.hudi.hive.MultiPartKeysValueExtractor
   ```
   
   The following sync hangs for my large table (~100TB, 150k partitions). 
Eventually the process fails.
   ```
   ./hudi-sync-tool --partitioned-by datasource,year,month --skip-ro-suffix 
--conditional-sync --base-path s3://bucket/lake/bigtable.all_hudi/ --database 
test_hudi --table bigtable_all --sync-mode hms --partition-value-extractor 
org.apache.hudi.hive.MultiPartKeysValueExtractor
   ```
   
   I attempted to use JDBC mode (which leverages batching), but it fails with 
an error related to partition values containing reserved value 
__HIVE_DEFAULT_PARTITION__
   ```
   ./hudi-sync-tool --partitioned-by datasource,year,month --skip-ro-suffix 
--conditional-sync --base-path s3://bucket/lake/bigtable.all_hudi/ --database 
test_hudi --table bigtable_all --partition-value-extractor 
org.apache.hudi.hive.MultiPartKeysValueExtractor --jdbc-url 
jdbc:hive2://<master-ip>.ec2.internal:10000 --user <user> --pass <pass> 
--batch-sync-num 5000
   ```
   
   Stacktrace: 
https://gist.github.com/noahtaite/856182fb867f22e85e06dd27bbfb73a0
   
   Any advice for manual Glue sync of large, heavily partitioned tables? Or if 
we can try to run AwsGlueCatalogSyncTool manually instead?
   
   **To Reproduce**
   
   Steps to reproduce the behavior (EMR 6.6.0, Hudi 0.10.1):
   
   1. Create Hudi table with hive-style partitioning with a nullable partition 
field.
   2. Write data with partition field = null, data written to 
__HIVE_DEFAULT_PARTITION__ folder
   3. Attempt hive sync using CLI
   4. Error as above in Stacktrace.
   
   **Expected behavior**
   
   Hive sync in batches with no issue. Preferably using native glue client/mode
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```     
           at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:70)
           at 
org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.lambda$addPartitionsToTable$0(QueryBasedDDLExecutor.java:124)
           at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
           at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
           at 
org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.addPartitionsToTable(QueryBasedDDLExecutor.java:124)
           at 
org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:109)
           at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:385)
           ... 4 more
   Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
compiling statement: FAILED: SemanticException [Error 10111]: Partition value 
contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ Reserved 
substring: __HIVE_DEFAULT_PARTITION__)
           at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:300)
           at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:286)
           at 
org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:324)
           at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:265)
           at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:68)
           ... 10 more
   Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
compiling statement: FAILED: SemanticException [Error 10111]: Partition value 
contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ Reserved 
substring: __HIVE_DEFAULT_PARTITION__)
           at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:348)
           at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:198)
           at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:261)
           at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:260)
           at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:549)
           at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:535)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
           at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
           at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
           at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
           at com.sun.proxy.$Proxy43.executeStatementAsync(Unknown Source)
           at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:318)
           at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:576)
           at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1550)
           at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1530)
           at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
           at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
           at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
           at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Partition 
value contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ 
Reserved substring: __HIVE_DEFAULT_PARTITION__)
           at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.validatePartitionValues(DDLSemanticAnalyzer.java:3959)
           at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:3500)
           at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:326)
           at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:294)
           at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:675)
           at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1872)
           at 
org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1819)
           at 
org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1814)
           at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126)
           at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:196)
           ... 27 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] noahtaite opened a new issue, #9067: [SUPPORT] Manual Glue sync for large, highly partitioned table failing

Reply via email to