[GitHub] [hudi] leobiscassi commented on issue #5485: [SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

GitBox Tue, 03 May 2022 07:20:55 -0700


leobiscassi commented on issue #5485:
URL: https://github.com/apache/hudi/issues/5485#issuecomment-1116157734


   Hi @yihua, thanks for the answer! About the your points:
   (1) Thank you, I didn't notice this possibility, these folders are annoying 
😓 
   
   (2) (3)  
   > the parquet files you generated do not have the date field in the schema, 
i.e., when each individual parquet file is directly read, the date field is not 
there.
   
   For Apache Spark the partitions are meant to be part of the schema, so every 
time we read `s3://hudi-issue-raw-zone/sample-data/` the schema of DataFrame is 
fields on the parquet + partitions, what you are saying is that independent of 
the datatype / style of the partitions from source dataset they won't be 
considered as fields, since Hudi Delta Streamer just list all the parquet files 
from the base path and read them directly, is that right to assume?
   
   > Hudi 0.9.0-amzn-1 does not support date-typed partition field. The support 
is only added recently https://github.com/apache/hudi/pull/5432. However, you 
can still using String-typed partition field.
   
   By this you mean that `hudi 0.9.0-amzn-1` doesn't support `date` typed 
partition field as partition on target, right? But the funny thing is: if I 
create the same sample data without partitioning I can write the data as hudi 
table without do the conversion from `date` to `string`, like the following 
code snippet:
   
   1. Building the new sample data
   ```python
   from pyspark.sql import SparkSession
   from datetime import date
   
   data = [
       {'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 1', 'email': '[email protected]'},
       {'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 2', 'email': '[email protected]'},
       {'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 3', 'email': '[email protected]'},
       {'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 4', 'email': '[email protected]'},
       {'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 5', 'email': '[email protected]'},
       {'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 6', 'email': '[email protected]'},
       {'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 
'Fake Name 7', 'email': '[email protected]'},
   ]
   
   spark = SparkSession.builder.getOrCreate()
   df = spark.createDataFrame(data)
   df.write.parquet('sample-data-without-partition')
   ```
   2. Running Hudi Delta Streamer Job
   ```bash
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
               --jars /usr/lib/spark/external/lib/spark-avro.jar \
               --master yarn \
               --deploy-mode client \
               --conf spark.sql.hive.convertMetastoreParquet=false 
/usr/lib/hudi/hudi-utilities-bundle.jar \
               --table-type COPY_ON_WRITE \
               --source-ordering-field ts \
               --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource \
               --target-table sample_data_complex_partition \
               --target-base-path 
s3a://hudi-issue-standard-zone/sample-data-complex-partition/ \
               --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://hudi-issue-raw-zone/sample-data-without-partition/
 \
               --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
               --hoodie-conf 
hoodie.datasource.write.hive_style_partitioning=true \
               --op UPSERT \
               --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
               --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   ```
   3. Result dataset on S3
   ```bash
   aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex-partition/
                              PRE .hoodie/
                              PRE date=2022-01-03/
                              PRE date=2022-01-04/
                              PRE date=2022-01-05/
                              PRE date=2022-02-05/
                              PRE date=2022-03-05/
                              PRE date=2022-05-01/
                              PRE date=2022-05-10/
   ```
   
   There is a reason why hudi delta streamer choose doesn't behavior like spark 
(considering the partitions as part of the schema)? I'm asking this because I 
particularly have some datasets that have some fields just as partitions, so we 
kind loose data in these scenarios. The way me and the team I work on do a work 
around this is using the SQL Transformer and the `INPUT_FILE_NAME()` SparkSQL 
built-in function and extracting the data with regex and/or substrings, do you 
recommend another way or even don't use hudi delta streamer in these scenarios?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] leobiscassi commented on issue #5485: [SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

Reply via email to