yihua commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2168840785
Pasting Yauhen's comment here so we have the context:
To elaborate a bit more on a possible silent data loss scenario we
discovered, but in a general way. In parquet-hadoop
`ParquetFileReader#internalReadRowGroup` there is a code
```
for (ColumnChunkMetaData mc : block.getColumns()) {
ColumnPath pathKey = mc.getPath();
ColumnDescriptor columnDescriptor = paths.get(pathKey);
if (columnDescriptor != null) {
```
Here `block` is constructed from the parquet's footer and if read from the
legacy list format file will have entry similar to this
```
{IntColumnChunkMetaData@22604} "ColumnMetaData{GZIP [internal_list, array]
repeated int64 array [PLAIN_DICTIONARY, RLE], 643}"
```
However, `paths` which initially are correctly built from the parquet's
footer will be mutated with writer's schema in `HoodieAvroReadSupport` during
merge logic in `HoodieMergeHelper`. So the map for `paths` will now contain a
key/value pair like
```
{ColumnPath@24497} "[internal_list, list, element]" ->
{ColumnDescriptor@24158} "[internal_list, list, element] required int64 element"
```
Therefore, `columnDescriptor = paths.get(pathKey)` will return null from
the java map because pathKey [internal_list, array] could not be found. So
`internalReadRowGroup` will return `rowGroup` with all values populated except
`internal_list` which will be completely missed out and end up having the null
value in the hoodie record which then will be written to a new parquet file (if
the writer schema allows optional value)
This behaviour is perfectly acceptable for the new nullable columns but for
the columns that should hold the same compatibility semantics while changing
the path in the context of `ParquetFileReader` can lead to a subtle data loss
issues like we see in that two/three level list case
I cannot think of the general robust way to handle it outside of
parquet-hadoop code but there might be a room for an additional validation
logic when `requestedSchema` is constructed in `HoodieAvroReadSupport` and
`super.init` is called
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]