hudi-bot opened a new issue, #15733:
URL: https://github.com/apache/hudi/issues/15733
When using a Deltastreamer Transformer, the output of the Transformer cannot
be used as values for partitioning. This is an issue if the user wants to use
something like the SqlQueryBasedTransformer or a custom transformer to generate
a partition field from another field in the incoming record.
In a test, I used the following configs:
{noformat}
hoodie.deltastreamer.transformer.sql=SELECT a.*, from_unixtime(timestamp,
'yyyy') as year, from_unixtime(timestamp, 'MM') as month,
from_unixtime(timestamp, 'dd') as day, from_unixtime(timestamp, 'HH') as hour
FROM <SRC> a
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=year,month,day,hour
{noformat}
What I expect to happen is that the data files in the output DFS are
formatted like this:
{noformat}
/path/to/dfs/table/<year>/<month>/<day>/<hour>/
eg:
s3://test-bucket/table/2023/01/30/15/{noformat}
However instead I get the following structure:
{noformat}
/path/to/dfs/table/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/__HIVE_DEFAULT_PARTITION__/{noformat}
I would expect the output of Transformers to be available for partitioning
just like any other column in the dataset.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5648
- Type: Bug
---
## Comments
30/Jan/23 02:47;cloventt;Related to this, is there some other way to
configure DeltaStreamer to create the desired /<year>/<month>/<day>/<hour>/
partitioning scheme from an input EPOCHMILLISECONDS column on the data stream?
I cannot use a date format of "yyyy/MM/dd/hh", because this would break when
setting the hoodie.datasource.write.hive_style_partitioning configuration
option. You end up with a a folder structure of:
{noformat}
/path/to/table/timestamp=2023/01/30/15{noformat}
which is not correct.
;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]