[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

GitBox Sun, 18 Dec 2022 01:36:44 -0800


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356756712


   Hi @Virmaline, it is quite strage. I have downloaded a full table on AWS 
that gives me 4 parquets (lets call them A, B, C, ,D). I have tested your 
configuration and works fine with all combinations unless read all of them at 
the same time.
   
   | Combination | Result |
   |-------------|--------|
   | A           | OK     |
   | B           | OK     |
   | C           | OK     |
   | D           | OK     |
   | A, B        | OK     |
   | A, C        | OK     |
   | A, D        | OK     |
   | B, C        | OK     |
   | B, D        | OK     |
   | C, D        | OK     |
   | A, B, C     | OK     |
   | A, B, D     | OK     |
   | A, C, D     | OK     |
   | B, C, D     | OK     |
   | A, B, C, D  | KO     |
   
   But if I read first three parquets (A, B, C) and then I readlast one (D), it 
works. It looks like is loosing spark-conf somewhere. This is my code of 
spark-submit
   ```
   spark-submit \
       --jars jars/hudi-ext-0.12.1.jar,jars/avro-1.11.1.jar \
       --conf spark.driver.memory=12g \
       --conf spark.driver.maxResultSize=12g \
       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
       --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED \
       --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \
       --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED \
       --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED \
       --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \
       --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
       --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
       --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
       --conf spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED \
       --driver-cores 8 \
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
jars/hudi-utilities-bundle_2.12-0.12.1.jar  \
       --table-type COPY_ON_WRITE  \
       --op BULK_INSERT  \
       --source-ordering-field dms_timestamp \
       --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
       --target-base-path /home/luis/parquet/consolidation/gccc_demand_cond/  \
       --target-table gccc_demand_cond  \
       --hoodie-conf hoodie.datasource.write.recordkey.field=id_demand_cond  \
       --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
  \
       --hoodie-conf hoodie.datasource.write.partitionpath.field=  \
       --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data/gccc_demand_cond \
       --hoodie-conf hoodie.datasource.write.drop.partition.columns=true  \
       --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true  \
       --hoodie-conf hoodie.cleaner.commits.retained=1800  \
       --hoodie-conf clean.retain_commits=1800  \
       --hoodie-conf archive.min_commits=2000  \
       --hoodie-conf archive.max_commits=2010  \
       --hoodie-conf hoodie.keep.min.commits=2000  \
       --hoodie-conf hoodie.keep.max.commits=2010  \
       --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer \
       --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
       --enable-sync  \
       --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \
       --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000  \
       --hoodie-conf hoodie.datasource.hive_sync.enable=true  \
       --hoodie-conf hoodie.datasource.hive_sync.database=consolidation  \
       --hoodie-conf hoodie.datasource.hive_sync.table=gccc_demand_cond  \
       --hoodie-conf hoodie.datasource.hive_sync.mode=hms  \
       --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true
   
   ```
   
   And this is my parquet schema:
   ```
   ############ file meta data ############
   created_by: AWS
   num_columns: 11
   num_rows: 1011052
   num_row_groups: 2023
   format_version: 1.0
   serialized_size: 1897645
   
   
   ############ Columns ############
   dms_timestamp
   create_date
   update_date
   update_user
   update_program
   optimist_lock
   id_demand_cond
   ini_date
   end_date
   id_sector_supply
   cod_demand_type
   
   ############ Column(dms_timestamp) ############
   name: dms_timestamp
   path: dms_timestamp
   max_definition_level: 0
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(create_date) ############
   name: create_date
   path: create_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   
   ############ Column(update_date) ############
   name: update_date
   path: update_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   
   ############ Column(update_user) ############
   name: update_user
   path: update_user
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(update_program) ############
   name: update_program
   path: update_program
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(optimist_lock) ############
   name: optimist_lock
   path: optimist_lock
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT32
   logical_type: Int(bitWidth=8, isSigned=true)
   converted_type (legacy): INT_8
   
   ############ Column(id_demand_cond) ############
   name: id_demand_cond
   path: id_demand_cond
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: FIXED_LEN_BYTE_ARRAY
   logical_type: Decimal(precision=15, scale=0)
   converted_type (legacy): DECIMAL
   
   ############ Column(ini_date) ############
   name: ini_date
   path: ini_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   
   ############ Column(end_date) ############
   name: end_date
   path: end_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   
   ############ Column(id_sector_supply) ############
   name: id_sector_supply
   path: id_sector_supply
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: FIXED_LEN_BYTE_ARRAY
   logical_type: Decimal(precision=15, scale=0)
   converted_type (legacy): DECIMAL
   
   ############ Column(cod_demand_type) ############
   name: cod_demand_type
   path: cod_demand_type
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   ```
   
   
   It is quite strange, because I have othe table with only two parquets with 
same total size (400 mb) that works fine with ur config.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

Reply via email to