[
https://issues.apache.org/jira/browse/HUDI-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681851#comment-17681851
]
David Palmer commented on HUDI-5648:
------------------------------------
Related to this, is there some other way to configure DeltaStreamer to create
the desired /<year>/<month>/<day>/<hour>/ partitioning scheme from an input
EPOCHMILLISECONDS column on the data stream? I cannot use a date format of
"yyyy/MM/dd/hh", because this would break when setting the
hoodie.datasource.write.hive_style_partitioning configuration option. You end
up with a a folder structure of:
{noformat}
/path/to/table/timestamp=2023/01/30/15{noformat}
which is not correct.
> Deltastreamer Transformer output cannot be used in partitioning
> ----------------------------------------------------------------
>
> Key: HUDI-5648
> URL: https://issues.apache.org/jira/browse/HUDI-5648
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: David Palmer
> Priority: Minor
>
> When using a Deltastreamer Transformer, the output of the Transformer cannot
> be used as values for partitioning. This is an issue if the user wants to use
> something like the SqlQueryBasedTransformer or a custom transformer to
> generate a partition field from another field in the incoming record.
> In a test, I used the following configs:
> {noformat}
> hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp,
> 'yyyy') as year, from_unixtime(timestamp, 'MM') as month,
> from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour
> FROM <SRC> a
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.write.partitionpath.field=year,month,day,hour
> {noformat}
> What I expect to happen is that the data files in the output DFS are
> formatted like this:
> {noformat}
> /path/to/dfs/table/<year>/<month>/<day>/<hour>/
> eg:
> s3://test-bucket/table/2023/01/30/15/{noformat}
> However instead I get the following structure:
> {noformat}
> /path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
> I would expect the output of Transformers to be available for partitioning
> just like any other column in the dataset.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)