leobiscassi commented on issue #5485: URL: https://github.com/apache/hudi/issues/5485#issuecomment-1116157734
Hi @yihua, thanks for the answer! About the your points: (1) Thank you, I didn't notice this possibility, these folders are annoying 😓 (2) (3) > the parquet files you generated do not have the date field in the schema, i.e., when each individual parquet file is directly read, the date field is not there. For Apache Spark the partitions are meant to be part of the schema, so every time we read `s3://hudi-issue-raw-zone/sample-data/` the schema of DataFrame is fields on the parquet + partitions, what you are saying is that independent of the datatype / style of the partitions from source dataset they won't be considered as fields, since Hudi Delta Streamer just list all the parquet files from the base path and read them directly, is that right to assume? > Hudi 0.9.0-amzn-1 does not support date-typed partition field. The support is only added recently https://github.com/apache/hudi/pull/5432. However, you can still using String-typed partition field. By this you mean that `hudi 0.9.0-amzn-1` doesn't support `date` typed partition field as partition on target, right? But the funny thing is: if I create the same sample data without partitioning I can write the data as hudi table without do the conversion from `date` to `string`, like the following code snippet: 1. Building the new sample data ```python from pyspark.sql import SparkSession from datetime import date data = [ {'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 1', 'email': '[email protected]'}, {'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 2', 'email': '[email protected]'}, {'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 3', 'email': '[email protected]'}, {'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 4', 'email': '[email protected]'}, {'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 5', 'email': '[email protected]'}, {'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 6', 'email': '[email protected]'}, {'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 7', 'email': '[email protected]'}, ] spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame(data) df.write.parquet('sample-data-without-partition') ``` 2. Running Hudi Delta Streamer Job ```bash spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --jars /usr/lib/spark/external/lib/spark-avro.jar \ --master yarn \ --deploy-mode client \ --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field ts \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-table sample_data_complex_partition \ --target-base-path s3a://hudi-issue-standard-zone/sample-data-complex-partition/ \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://hudi-issue-raw-zone/sample-data-without-partition/ \ --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --op UPSERT \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator ``` 3. Result dataset on S3 ```bash aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex-partition/ PRE .hoodie/ PRE date=2022-01-03/ PRE date=2022-01-04/ PRE date=2022-01-05/ PRE date=2022-02-05/ PRE date=2022-03-05/ PRE date=2022-05-01/ PRE date=2022-05-10/ ``` There is a reason why hudi delta streamer choose doesn't behavior like spark (considering the partitions as part of the schema)? I'm asking this because I particularly have some datasets that have some fields just as partitions, so we kind loose data in these scenarios. The way me and the team I work on do a work around this is using the SQL Transformer and the `INPUT_FILE_NAME()` SparkSQL built-in function and extracting the data with regex and/or substrings, do you recommend another way or even don't use hudi delta streamer in these scenarios? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
