[jira] [Updated] (HUDI-5648) Deltastreamer Transformer output cannot be used in partitioning

David Palmer (Jira) Sun, 29 Jan 2023 18:35:05 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Palmer updated HUDI-5648:
-------------------------------
    Description: 
When using a Deltastreamer Transformer, the output of the Transformer cannot be 
used as values for partitioning. This is an issue if the user wants to use 
something like the SqlQueryBasedTransformer or a custom transformer to generate 
a partition field from another field in the incoming record.

In a test, I used the following configs:
{noformat}
hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp, 
'yyyy') as year, from_unixtime(timestamp, 'MM') as month, 
from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour 
FROM <SRC> a

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=year,month,day,hour
 {noformat}
What I expect to happen is that the data files in the output DFS are formatted 
like this:
{noformat}
/path/to/dfs/table/<year>/<month>/<day>/<hour>/
eg:
s3://test-bucket/table/2023/01/30/15/{noformat}
However instead I get the following structure:
{noformat}
/path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
I would expect the output of Transformers to be available for partitioning just 
like any other column in the dataset.

 

  was:
When using a Deltastreamer Transformer, the output of the Transformer cannot be 
used as values for partitioning.

In a test, I used the following configs:

 
{noformat}
hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp, 
'yyyy') as year, from_unixtime(timestamp, 'MM') as month, 
from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour 
FROM <SRC> a

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=year,month,day,hour
 {noformat}
What I expect to happen is that the data files in the output DFS are formatted 
like this:
{noformat}
/path/to/dfs/table/<year>/<month>/<day>/<hour>/
eg:
s3://test-bucket/table/2023/01/30/15/{noformat}
However instead I get the following structure:
{noformat}
/path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
I would expect the output of Transformers to be available for partitioning just 
like any other column in the dataset.

 


> Deltastreamer Transformer output cannot be used in partitioning 
> ----------------------------------------------------------------
>
>                 Key: HUDI-5648
>                 URL: https://issues.apache.org/jira/browse/HUDI-5648
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: David Palmer
>            Priority: Minor
>
> When using a Deltastreamer Transformer, the output of the Transformer cannot 
> be used as values for partitioning. This is an issue if the user wants to use 
> something like the SqlQueryBasedTransformer or a custom transformer to 
> generate a partition field from another field in the incoming record.
> In a test, I used the following configs:
> {noformat}
> hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp, 
> 'yyyy') as year, from_unixtime(timestamp, 'MM') as month, 
> from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour 
> FROM <SRC> a
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.write.partitionpath.field=year,month,day,hour
>  {noformat}
> What I expect to happen is that the data files in the output DFS are 
> formatted like this:
> {noformat}
> /path/to/dfs/table/<year>/<month>/<day>/<hour>/
> eg:
> s3://test-bucket/table/2023/01/30/15/{noformat}
> However instead I get the following structure:
> {noformat}
> /path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
> I would expect the output of Transformers to be available for partitioning 
> just like any other column in the dataset.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5648) Deltastreamer Transformer output cannot be used in partitioning

Reply via email to