[jira] [Commented] (HUDI-5648) Deltastreamer Transformer output cannot be used in partitioning

David Palmer (Jira) Sun, 29 Jan 2023 18:48:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681851#comment-17681851
 ]


David Palmer commented on HUDI-5648:
------------------------------------

Related to this, is there some other way to configure DeltaStreamer to create 
the desired /<year>/<month>/<day>/<hour>/ partitioning scheme from an input 
EPOCHMILLISECONDS column on the data stream? I cannot use a date format of 
"yyyy/MM/dd/hh", because this would break when setting the 
hoodie.datasource.write.hive_style_partitioning configuration option. You end 
up with a a folder structure of:

 
{noformat}
/path/to/table/timestamp=2023/01/30/15{noformat}
which is not correct.

 

> Deltastreamer Transformer output cannot be used in partitioning 
> ----------------------------------------------------------------
>
>                 Key: HUDI-5648
>                 URL: https://issues.apache.org/jira/browse/HUDI-5648
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: David Palmer
>            Priority: Minor
>
> When using a Deltastreamer Transformer, the output of the Transformer cannot 
> be used as values for partitioning. This is an issue if the user wants to use 
> something like the SqlQueryBasedTransformer or a custom transformer to 
> generate a partition field from another field in the incoming record.
> In a test, I used the following configs:
> {noformat}
> hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp, 
> 'yyyy') as year, from_unixtime(timestamp, 'MM') as month, 
> from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour 
> FROM <SRC> a
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.write.partitionpath.field=year,month,day,hour
>  {noformat}
> What I expect to happen is that the data files in the output DFS are 
> formatted like this:
> {noformat}
> /path/to/dfs/table/<year>/<month>/<day>/<hour>/
> eg:
> s3://test-bucket/table/2023/01/30/15/{noformat}
> However instead I get the following structure:
> {noformat}
> /path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
> I would expect the output of Transformers to be available for partitioning 
> just like any other column in the dataset.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-5648) Deltastreamer Transformer output cannot be used in partitioning

Reply via email to