noahtaite opened a new issue, #9067: URL: https://github.com/apache/hudi/issues/9067
**Describe the problem you faced** I'm trying to do a Hive Sync to AWS Glue metastore for a table with 3 levels of partitioning and 150k partitions using hudi-sync-tool from the master node of a long-living EMR cluster and running into some issues. **Background** Very large (~100TB) MOR Hudi table with 3 levels of partitioning (datasource, year, month) in S3. Prod table bulk inserted + upserted daily by Spark on EMR 6.10 clusters + glue table syncs during the Spark job which doesn't seem to add too much overhead to the application. Table was copied to a new location using S3 bucket replication and need new Glue table created + synced for the new location. **Problem** After research, it appears the supported tool for this is hudi-sync-tool, which I have been running on an EMR 6.6.0 cluster with Hudi 0.10.1 installed. The following sync works for smaller tables (<10TB, ~2k partitions) ``` ./hudi-sync-tool --partitioned-by datasource --skip-ro-suffix --conditional-sync --base-path s3://bucket/table.all_hudi/ --database test_hudi --table table_all --sync-mode hms --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor ``` The following sync hangs for my large table (~100TB, 150k partitions). Eventually the process fails. ``` ./hudi-sync-tool --partitioned-by datasource,year,month --skip-ro-suffix --conditional-sync --base-path s3://bucket/lake/bigtable.all_hudi/ --database test_hudi --table bigtable_all --sync-mode hms --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor ``` I attempted to use JDBC mode (which leverages batching), but it fails with an error related to partition values containing reserved value __HIVE_DEFAULT_PARTITION__ ``` ./hudi-sync-tool --partitioned-by datasource,year,month --skip-ro-suffix --conditional-sync --base-path s3://bucket/lake/bigtable.all_hudi/ --database test_hudi --table bigtable_all --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor --jdbc-url jdbc:hive2://<master-ip>.ec2.internal:10000 --user <user> --pass <pass> --batch-sync-num 5000 ``` Stacktrace: https://gist.github.com/noahtaite/856182fb867f22e85e06dd27bbfb73a0 Any advice for manual Glue sync of large, heavily partitioned tables? Or if we can try to run AwsGlueCatalogSyncTool manually instead? **To Reproduce** Steps to reproduce the behavior (EMR 6.6.0, Hudi 0.10.1): 1. Create Hudi table with hive-style partitioning with a nullable partition field. 2. Write data with partition field = null, data written to __HIVE_DEFAULT_PARTITION__ folder 3. Attempt hive sync using CLI 4. Error as above in Stacktrace. **Expected behavior** Hive sync in batches with no issue. Preferably using native glue client/mode **Environment Description** * Hudi version : 0.10.1 * Spark version : 3.2.0 * Hive version : 3.1.2 * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ``` at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:70) at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.lambda$addPartitionsToTable$0(QueryBasedDDLExecutor.java:124) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.addPartitionsToTable(QueryBasedDDLExecutor.java:124) at org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:109) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:385) ... 4 more Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10111]: Partition value contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ Reserved substring: __HIVE_DEFAULT_PARTITION__) at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:300) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:286) at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:324) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:265) at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:68) ... 10 more Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10111]: Partition value contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ Reserved substring: __HIVE_DEFAULT_PARTITION__) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:348) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:198) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:261) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:260) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:549) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:535) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy43.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:318) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:576) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1550) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1530) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Partition value contains a reserved substring (User value: __HIVE_DEFAULT_PARTITION__ Reserved substring: __HIVE_DEFAULT_PARTITION__) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.validatePartitionValues(DDLSemanticAnalyzer.java:3959) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:3500) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:326) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:294) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:675) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1872) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1819) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1814) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:196) ... 27 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
