lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356756712
Hi @Virmaline, it is quite strage. I have downloaded a full table on AWS
that gives me 4 parquets (lets call them A, B, C, ,D). I have tested your
configuration and works fine with all combinations unless read all of them at
the same time.
| Combination | Result |
|-------------|--------|
| A | OK |
| B | OK |
| C | OK |
| D | OK |
| A, B | OK |
| A, C | OK |
| A, D | OK |
| B, C | OK |
| B, D | OK |
| C, D | OK |
| A, B, C | OK |
| A, B, D | OK |
| A, C, D | OK |
| B, C, D | OK |
| A, B, C, D | KO |
But if I read first three parquets (A, B, C) and then I readlast one (D), it
works. It looks like is loosing spark-conf somewhere. This is my code of
spark-submit
```
spark-submit \
--jars jars/hudi-ext-0.12.1.jar,jars/avro-1.11.1.jar \
--conf spark.driver.memory=12g \
--conf spark.driver.maxResultSize=12g \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED \
--conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \
--conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED \
--conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED \
--conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \
--conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
--conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
--conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
--conf spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED \
--driver-cores 8 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
jars/hudi-utilities-bundle_2.12-0.12.1.jar \
--table-type COPY_ON_WRITE \
--op BULK_INSERT \
--source-ordering-field dms_timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path /home/luis/parquet/consolidation/gccc_demand_cond/ \
--target-table gccc_demand_cond \
--hoodie-conf hoodie.datasource.write.recordkey.field=id_demand_cond \
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
\
--hoodie-conf hoodie.datasource.write.partitionpath.field= \
--hoodie-conf
hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data/gccc_demand_cond \
--hoodie-conf hoodie.datasource.write.drop.partition.columns=true \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf hoodie.cleaner.commits.retained=1800 \
--hoodie-conf clean.retain_commits=1800 \
--hoodie-conf archive.min_commits=2000 \
--hoodie-conf archive.max_commits=2010 \
--hoodie-conf hoodie.keep.min.commits=2000 \
--hoodie-conf hoodie.keep.max.commits=2010 \
--transformer-class
org.apache.hudi.utilities.transform.AWSDmsTransformer \
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
--enable-sync \
--sync-tool-classes org.apache.hudi.hive.HiveSyncTool \
--hoodie-conf
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \
--hoodie-conf hoodie.datasource.hive_sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.database=consolidation \
--hoodie-conf hoodie.datasource.hive_sync.table=gccc_demand_cond \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true
```
And this is my parquet schema:
```
############ file meta data ############
created_by: AWS
num_columns: 11
num_rows: 1011052
num_row_groups: 2023
format_version: 1.0
serialized_size: 1897645
############ Columns ############
dms_timestamp
create_date
update_date
update_user
update_program
optimist_lock
id_demand_cond
ini_date
end_date
id_sector_supply
cod_demand_type
############ Column(dms_timestamp) ############
name: dms_timestamp
path: dms_timestamp
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(create_date) ############
name: create_date
path: create_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
############ Column(update_date) ############
name: update_date
path: update_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
############ Column(update_user) ############
name: update_user
path: update_user
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(update_program) ############
name: update_program
path: update_program
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(optimist_lock) ############
name: optimist_lock
path: optimist_lock
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
############ Column(id_demand_cond) ############
name: id_demand_cond
path: id_demand_cond
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=15, scale=0)
converted_type (legacy): DECIMAL
############ Column(ini_date) ############
name: ini_date
path: ini_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
############ Column(end_date) ############
name: end_date
path: end_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
############ Column(id_sector_supply) ############
name: id_sector_supply
path: id_sector_supply
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=15, scale=0)
converted_type (legacy): DECIMAL
############ Column(cod_demand_type) ############
name: cod_demand_type
path: cod_demand_type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
```
It is quite strange, because I have othe table with only two parquets with
same total size (400 mb) that works fine with ur config.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]