yihua opened a new pull request, #6851:
URL: https://github.com/apache/hudi/pull/6851

   ### Change Logs
   
   This PR fixes the issue reported in #6281.
   
   For Deltastreamer, when using `TimestampBasedKeyGenerator` with the 
customized output dateformat 
(`hoodie.deltastreamer.keygen.timebased.output.dateformat`) of partition path 
containing slashes, e.g., "yyyy/MM/dd", and hive-style partitioning disabled 
(by default), the meta sync fails.  Relevant key generator configs are:
   
   ```
   --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS 
   ```
   Hive Sync exception:
   ```
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
partitions for table test_table
   ...
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table 
add partition failed
   ...
   Caused by: MetaException(message:Invalid partition key & values; keys 
[createddate, ], values [2022, 10, 02, ])
       at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
   ...
   ```
   Glue Sync exception:
   ```
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Could 
not sync using the meta sync class 
org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
   ...
   Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to add 
partitions to default.test_table
        at 
org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:147)
   ...
   Caused by: 
org.apache.hudi.com.amazonaws.services.glue.model.InvalidInputException: The 
number of partition keys do not match the number of partition values (Service: 
AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: <>; 
Proxy: null)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
   ...
   ```
   
   The exception is thrown because the partition values for meta sync are not 
properly extracted.  In the current logic, 
"hoodie.datasource.hive_sync.partition_extractor_class" determines the 
partition extractor to use and in such a case, the 
`MultiPartKeysValueExtractor` is inferred to be used.  The root cause is that 
this extractor splits the parts by slashes, i.e., `2022/10/02` -> `[2022, 10, 
02]`, instead of treating it as a single value, as there is only one partition 
column.  In general, if user specifies the output dateformat to contain 
slashes, that fails the extraction.
   
   This PR fixes the problem by introducing a new partition extractor, 
`SinglePartPartitionValueExtractor`, so that we treat the partition value as a 
whole when there is only a single partition column, instead of relying on 
`MultiPartKeysValueExtractor`.  The slash (`/`) is replaced by dash (`-`), as 
slashes are encoded by default, making it inconvenient for querying.
   
   ### Impact
   
   **Risk level: low**
   
   The fix is tested locally with Hive sync and on EMR with Glue Sync.  Before 
this fix, the meta sync fails.  After the fix, the meta sync succeeds.  The 
correct partitions can be shown: beeline (with `show partitions test_table;`) 
in Hive and Glue web UI for Glue Data Catalog.
   
   ### Documentation Update
   
   We need to improve the docs for meta sync with `TimestampBasedKeyGenerator`. 
 Docs update is tracked in HUDI-4967.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to