[
https://issues.apache.org/jira/browse/HUDI-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Palmer updated HUDI-5648:
-------------------------------
Description:
When using a Deltastreamer Transformer, the output of the Transformer cannot be
used as values for partitioning. This is an issue if the user wants to use
something like the SqlQueryBasedTransformer or a custom transformer to generate
a partition field from another field in the incoming record.
In a test, I used the following configs:
{noformat}
hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp,
'yyyy') as year, from_unixtime(timestamp, 'MM') as month,
from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour
FROM <SRC> a
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=year,month,day,hour
{noformat}
What I expect to happen is that the data files in the output DFS are formatted
like this:
{noformat}
/path/to/dfs/table/<year>/<month>/<day>/<hour>/
eg:
s3://test-bucket/table/2023/01/30/15/{noformat}
However instead I get the following structure:
{noformat}
/path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
I would expect the output of Transformers to be available for partitioning just
like any other column in the dataset.
was:
When using a Deltastreamer Transformer, the output of the Transformer cannot be
used as values for partitioning.
In a test, I used the following configs:
{noformat}
hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp,
'yyyy') as year, from_unixtime(timestamp, 'MM') as month,
from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour
FROM <SRC> a
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=year,month,day,hour
{noformat}
What I expect to happen is that the data files in the output DFS are formatted
like this:
{noformat}
/path/to/dfs/table/<year>/<month>/<day>/<hour>/
eg:
s3://test-bucket/table/2023/01/30/15/{noformat}
However instead I get the following structure:
{noformat}
/path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
I would expect the output of Transformers to be available for partitioning just
like any other column in the dataset.
> Deltastreamer Transformer output cannot be used in partitioning
> ----------------------------------------------------------------
>
> Key: HUDI-5648
> URL: https://issues.apache.org/jira/browse/HUDI-5648
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: David Palmer
> Priority: Minor
>
> When using a Deltastreamer Transformer, the output of the Transformer cannot
> be used as values for partitioning. This is an issue if the user wants to use
> something like the SqlQueryBasedTransformer or a custom transformer to
> generate a partition field from another field in the incoming record.
> In a test, I used the following configs:
> {noformat}
> hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp,
> 'yyyy') as year, from_unixtime(timestamp, 'MM') as month,
> from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour
> FROM <SRC> a
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.write.partitionpath.field=year,month,day,hour
> {noformat}
> What I expect to happen is that the data files in the output DFS are
> formatted like this:
> {noformat}
> /path/to/dfs/table/<year>/<month>/<day>/<hour>/
> eg:
> s3://test-bucket/table/2023/01/30/15/{noformat}
> However instead I get the following structure:
> {noformat}
> /path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
> I would expect the output of Transformers to be available for partitioning
> just like any other column in the dataset.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)