[GitHub] [hudi] leobiscassi opened a new issue, #5485: [SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

GitBox Mon, 02 May 2022 13:12:46 -0700


leobiscassi opened a new issue, #5485:
URL: https://github.com/apache/hudi/issues/5485


   **Describe the problem you faced**
   
   _Hudi Delta Streamer_ doesn't recognize date hive style partitions (e.g. 
`date=2022-01-01`) on my dataset. I'm wondering if I'm missing some 
configuration or if I'm doing something wrong.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run the following script to create sample data
   
   ```python
   from pyspark.sql import SparkSession
   from datetime import date
   
   data = [
       {'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 1', 'email': '[email protected]'},
       {'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 2', 'email': '[email protected]'},
       {'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 3', 'email': '[email protected]'},
       {'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 4', 'email': '[email protected]'},
       {'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 5', 'email': '[email protected]'},
       {'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 6', 'email': '[email protected]'},
       {'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 7', 'email': '[email protected]'},
   ]
   
   spark = SparkSession.builder.getOrCreate()
   df = spark.createDataFrame(data)
   df.write.partitionBy('date').parquet('sample-data')
   ```
   2. Create a S3 bucket (e.g. `hudi-issue-raw-zone` on this example) w/ server 
side encryption (e.g. `SSE-S3` on this example) and upload the `sample-data`. 
Create a second bucket to simulate standard zone (e.g. 
`hudi-issue-standard-zone` on this example)
   3. Create an EMR cluster with EMR release 6.5.0 (`hadoop 3.2.1`, `hive 
3.1.2`,  `spark 3.1.2`), in the section `AWS Glue Data Catalog settings` mark 
the options `Use for Hive table metadata` and `Use for Spark table metadata`. 
In this case I'm selecting a simple 3 node cluster, 1 master and 2 cores with 
the instance type `m5.xlarge` and using a key par to connect, the rest of 
options are used with default values. The region is `us-west-2`.
   4. Run the following hudi delta streamer job
   
   ```bash
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
               --jars /usr/lib/spark/external/lib/spark-avro.jar \
               --master yarn \
               --deploy-mode client \
               --conf spark.sql.hive.convertMetastoreParquet=false 
/usr/lib/hudi/hudi-utilities-bundle.jar \
               --table-type COPY_ON_WRITE \
               --source-ordering-field ts \
               --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource \
               --target-table sample_data_complex \
               --target-base-path 
s3://hudi-issue-standard-zone/sample-data-complex/ \
               --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
               --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
               --hoodie-conf 
hoodie.datasource.write.hive_style_partitioning=true \
               --op UPSERT \
               --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
               --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   ```
   
   This is the output stored on S3:
   
   ```bash
   aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                              PRE .hoodie/
                              PRE date=default/
   2022-05-02 19:56:20          0 .hoodie_$folder$
   2022-05-02 19:56:59          0 date=default_$folder$
   ```
   
   Using the `CustomKeyGenerator` that works w/ timestamp based partitions:
   
   ```bash
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
               --jars /usr/lib/spark/external/lib/spark-avro.jar \
               --master yarn \
               --deploy-mode client \
               --conf spark.sql.hive.convertMetastoreParquet=false 
/usr/lib/hudi/hudi-utilities-bundle.jar \
               --table-type COPY_ON_WRITE \
               --source-ordering-field ts \
               --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource \
               --target-table sample_data_custom \
               --target-base-path 
s3://hudi-issue-standard-zone/sample-data-custom/ \
               --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
               --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
               --hoodie-conf 
hoodie.datasource.write.hive_style_partitioning=true \
               --op UPSERT \
               --hoodie-conf 
hoodie.datasource.write.partitionpath.field=date:timestamp \
               --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
               --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd" \
               --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy-MM-dd" \
               --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   ```
   
   This is the output stored on S3:
   
   ```bash
   aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                              PRE .hoodie/
                              PRE date=1970-01-01/
   2022-05-02 19:58:48          0 .hoodie_$folder$
   2022-05-02 19:59:26          0 date=1970-01-01_$folder$
   ```
   
   **Expected behavior**
   
   I would expect to have the following output:
   
   ```bash
   aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                              PRE .hoodie/
                              PRE date=2022-01-03/
                              PRE date=2022-01-04/
                              PRE date=2022-01-05/
                              PRE date=2022-02-05/
                              PRE date=2022-03-05/
                              PRE date=2022-05-01/
                              PRE date=2022-05-10/
   ```
   Or
   ```bash
   aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                              PRE .hoodie/
                              PRE date=2022-01-03/
                              PRE date=2022-01-04/
                              PRE date=2022-01-05/
                              PRE date=2022-02-05/
                              PRE date=2022-03-05/
                              PRE date=2022-05-01/
                              PRE date=2022-05-10/
   ```
   
   **Environment Description**
   
   * Hudi version : 0.9.0-amzn-1
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   **Stacktrace**
   
   There is no stack trace in this case, just an unexpected value.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] leobiscassi opened a new issue, #5485: [SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

Reply via email to