voon created HUDI-6103:
--------------------------
Summary: Validate that fieldNames are valid for streaming reads
Key: HUDI-6103
URL: https://issues.apache.org/jira/browse/HUDI-6103
Project: Apache Hudi
Issue Type: Improvement
Components: flink, flink-sql
Reporter: voon
Assignee: voon
The current error message that is thrown when an invalid fieldName is provided
in the FlinkSQL table source DDL is ambiguous and not helpful.
Example:
{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.lambda$genPartColumnarRowReader$0(ParquetSplitReaderUtil.java:119)
at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
at
java.util.Spliterators$IntArraySpliterator.forEachRemaining(Spliterators.java:1032)
at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:121)
at
org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:56)
at
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.getBaseFileIterator(MergeOnReadInputFormat.java:341)
at
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.getBaseFileIterator(MergeOnReadInputFormat.java:316)
at
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.initIterator(MergeOnReadInputFormat.java:200)
at
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:185)
at
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:91)
at
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
{code}
Such user-errors can be easily fixed, but the error message that is thrown does
not make such an error obvious.
This Jira ticket aims to add better error message so to make such errors more
obvious.
*How to trigger the error*
For a source table with a schema as such:
{code:java}
CREATE TABLE `table_with_correct_schema` (
`id` INT,
`user_id` INT,
`name` STRING,
`partition_col` STRING
) PARTITIONED BY (`partition_col`)
WITH (
'connector' = 'hudi',
...
){code}
Change a column to an incorrect name as such when reading:
{code:java}
CREATE TABLE `table_with_correct_schema` (
`id` INT,
`user_id_with_typo123` INT,
`name` STRING,
`partition_col` STRING
) PARTITIONED BY (`partition_col`)
WITH (
'connector' = 'hudi',
...
){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)