leobiscassi opened a new issue, #5485:
URL: https://github.com/apache/hudi/issues/5485
**Describe the problem you faced**
_Hudi Delta Streamer_ doesn't recognize date hive style partitions (e.g.
`date=2022-01-01`) on my dataset. I'm wondering if I'm missing some
configuration or if I'm doing something wrong.
**To Reproduce**
Steps to reproduce the behavior:
1. Run the following script to create sample data
```python
from pyspark.sql import SparkSession
from datetime import date
data = [
{'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 1', 'email': '[email protected]'},
{'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 2', 'email': '[email protected]'},
{'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 3', 'email': '[email protected]'},
{'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 4', 'email': '[email protected]'},
{'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 5', 'email': '[email protected]'},
{'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 6', 'email': '[email protected]'},
{'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name':
'Fake Name 7', 'email': '[email protected]'},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df.write.partitionBy('date').parquet('sample-data')
```
2. Create a S3 bucket (e.g. `hudi-issue-raw-zone` on this example) w/ server
side encryption (e.g. `SSE-S3` on this example) and upload the `sample-data`.
Create a second bucket to simulate standard zone (e.g.
`hudi-issue-standard-zone` on this example)
3. Create an EMR cluster with EMR release 6.5.0 (`hadoop 3.2.1`, `hive
3.1.2`, `spark 3.1.2`), in the section `AWS Glue Data Catalog settings` mark
the options `Use for Hive table metadata` and `Use for Spark table metadata`.
In this case I'm selecting a simple 3 node cluster, 1 master and 2 cores with
the instance type `m5.xlarge` and using a key par to connect, the rest of
options are used with default values. The region is `us-west-2`.
4. Run the following hudi delta streamer job
```bash
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--jars /usr/lib/spark/external/lib/spark-avro.jar \
--master yarn \
--deploy-mode client \
--conf spark.sql.hive.convertMetastoreParquet=false
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field ts \
--source-class
org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table sample_data_complex \
--target-base-path
s3://hudi-issue-standard-zone/sample-data-complex/ \
--hoodie-conf
hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
--hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
--hoodie-conf
hoodie.datasource.write.hive_style_partitioning=true \
--op UPSERT \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date \
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
```
This is the output stored on S3:
```bash
aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
PRE .hoodie/
PRE date=default/
2022-05-02 19:56:20 0 .hoodie_$folder$
2022-05-02 19:56:59 0 date=default_$folder$
```
Using the `CustomKeyGenerator` that works w/ timestamp based partitions:
```bash
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--jars /usr/lib/spark/external/lib/spark-avro.jar \
--master yarn \
--deploy-mode client \
--conf spark.sql.hive.convertMetastoreParquet=false
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field ts \
--source-class
org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table sample_data_custom \
--target-base-path
s3://hudi-issue-standard-zone/sample-data-custom/ \
--hoodie-conf
hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
--hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
--hoodie-conf
hoodie.datasource.write.hive_style_partitioning=true \
--op UPSERT \
--hoodie-conf
hoodie.datasource.write.partitionpath.field=date:timestamp \
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd" \
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy-MM-dd" \
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
```
This is the output stored on S3:
```bash
aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
PRE .hoodie/
PRE date=1970-01-01/
2022-05-02 19:58:48 0 .hoodie_$folder$
2022-05-02 19:59:26 0 date=1970-01-01_$folder$
```
**Expected behavior**
I would expect to have the following output:
```bash
aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
PRE .hoodie/
PRE date=2022-01-03/
PRE date=2022-01-04/
PRE date=2022-01-05/
PRE date=2022-02-05/
PRE date=2022-03-05/
PRE date=2022-05-01/
PRE date=2022-05-10/
```
Or
```bash
aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
PRE .hoodie/
PRE date=2022-01-03/
PRE date=2022-01-04/
PRE date=2022-01-05/
PRE date=2022-02-05/
PRE date=2022-03-05/
PRE date=2022-05-01/
PRE date=2022-05-10/
```
**Environment Description**
* Hudi version : 0.9.0-amzn-1
* Spark version : 3.1.2
* Hive version : 3.1.2
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Stacktrace**
There is no stack trace in this case, just an unexpected value.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]