BruceKellan commented on issue #8685:
URL: https://github.com/apache/hudi/issues/8685#issuecomment-1548964729
@danny0405
danny, I have initially located the problem, I would like to hear your
different options.
In master, hudi-flink maintain the logic of writing to Parquet separately
and is inconsistent with the schema of the parquet file written by spark when
using complex types.
I did some tests using flink and spark to write the same complex type and
their schema in parquet is different.
The biggest difference is the key part of map in spark is required.
spark_insert.parquet:
```
message hoodie.hudi_trips_cow.hudi_trips_cow_record {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
optional int32 f_int;
optional group f_array (LIST) {
repeated binary array (STRING);
}
optional group int_array (LIST) {
repeated int32 array;
}
optional group f_map (MAP) {
repeated group map (MAP_KEY_VALUE) {
required binary key (STRING);
optional int32 value;
}
}
optional group f_row {
optional group f_nested_array (LIST) {
repeated binary array (STRING);
}
optional group f_nested_row {
optional int32 f_row_f0;
optional binary f_row_f1 (STRING);
}
}
}
```
flink_insert.parquet:
```
message flink_schema {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
required int32 f_int;
optional group f_array (LIST) {
repeated group list {
optional binary element (STRING);
}
}
optional group int_array (LIST) {
repeated group list {
optional int32 element;
}
}
optional group f_map (MAP) {
repeated group key_value {
optional binary key (STRING);
optional int32 value;
}
}
optional group f_row {
optional group f_nested_array (LIST) {
repeated group list {
optional binary element (STRING);
}
}
optional group f_nested_row {
optional int32 f_row_f0;
optional binary f_row_f1 (STRING);
}
}
}
```
The reason why there was no problem in 0.12.3 is because #7345.
This PR seems to be applicable to spark, but due to the inconsistency of the
flink schema, an error is reported after set request projection schema.
https://github.com/apache/hudi/blob/d2b411ad192cc5113363398e985cb21647fa8693/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroParquetReader.java#LL161C1-L167C6
IMO, we may need to add a patch to rollback the change of clustering
operator.
And then we need to unified the flink parquet schema and spark parquet
schema. but it's a breaking change. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]