[
https://issues.apache.org/jira/browse/HUDI-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon updated HUDI-6211:
-----------------------
Description:
If a Hudi-table's column with a complex type (struct) is evolved with newly
added fields, or reordered fields, Hudi-On-Flink will not be able to read the
parquet files that are in the old schema.
An error as such will be thrown:
Different errors will be thrown for schema evolution operations. If a new
column is added at the back of the struct column, the error below will be
thrown.
{code:java}
Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at org.apache.parquet.schema.GroupType.getType(GroupType.java:216)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:528)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:416)
at
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:217)
at
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.<init>(ParquetColumnarRowSplitReader.java:157)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:148)
at
org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:72)
at
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:132)
at
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
at
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332)
{code}
If a column is added in the middle of the struct column, a error as such will
be thrown:
{code:java}
java.lang.IllegalArgumentException: Unexpected type: INT32 at
org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77) at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:456)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:413)
at
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:216)
at
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.<init>(ParquetColumnarRowSplitReader.java:156)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:151)
at
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:151)
at
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:68)
at
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:116)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:128)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
{code}
> Fix reading of schema-evolved complex columns for Flink
> -------------------------------------------------------
>
> Key: HUDI-6211
> URL: https://issues.apache.org/jira/browse/HUDI-6211
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Priority: Major
>
> If a Hudi-table's column with a complex type (struct) is evolved with newly
> added fields, or reordered fields, Hudi-On-Flink will not be able to read the
> parquet files that are in the old schema.
>
> An error as such will be thrown:
>
> Different errors will be thrown for schema evolution operations. If a new
> column is added at the back of the struct column, the error below will be
> thrown.
>
> {code:java}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
> at java.util.ArrayList.rangeCheck(ArrayList.java:659)
> at java.util.ArrayList.get(ArrayList.java:435)
> at org.apache.parquet.schema.GroupType.getType(GroupType.java:216)
> at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:528)
> at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:416)
> at
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:217)
> at
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.<init>(ParquetColumnarRowSplitReader.java:157)
> at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:148)
> at
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:72)
> at
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:132)
> at
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
> at
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332)
> {code}
>
>
> If a column is added in the middle of the struct column, a error as such will
> be thrown:
>
> {code:java}
> java.lang.IllegalArgumentException: Unexpected type: INT32 at
> org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77) at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:456)
> at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:413)
> at
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:216)
> at
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.<init>(ParquetColumnarRowSplitReader.java:156)
> at
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:151)
> at
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:151)
> at
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:68)
> at
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:116)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:128)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)