Re: Iceberg-arrow vectorized read bug

Lessard, Steve Mon, 12 Aug 2024 12:00:27 -0700

Iceberg-arrow is a non-trivial chunk of code that appears to me as though it 
has no current owner. Based on the history I find in GitHub it appears the 
original author of the org.apache.iceberg.arrow.vectorized package, Mayur 
Srivastava, wrote this package as his one and only contribution to any public 
repository in GitHub.com and hasn’t been seen since.  That was back in April of 
2021<https://github.com/apache/iceberg/commit/657de686a1e74e9b5b77fb81e23192cfeecab057>.
 Now I need to fix a bug in this package and cannot find anyone to mentor me 
through changing this complex package.




I’ve taken yet another look at this bug in the vectorized Arrow reader code, 
issue #10275<https://github.com/apache/iceberg/issues/10275>. As a summary the 
issue is that vectorized reads fail when the parquet file does not contain a 
column that exists within the Iceberg schema. This is a schema evolution issue. 
I spent several hours studying the code, learning where and how readers for 
each column are created, how the vector holder for each column is produced, and 
how the value for each column is consumed via an ArrowVectorAccessor instance. 
I have concluded that this isn’t really a bug so much as a never implemented 
feature. There are two main reasons for my conclusion:

1.      The code currently thows a NullPointerException when trying to read a 
column that exists in the Iceberg schema but does not exist in the parquet 
schema

2.      This comment in the source code for ArrowBatchReader.java line 
53<https://github.com/apache/iceberg/blob/8bc1dde5cb587a840476004477e9f5e827ae5d61/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java#L53>,
 “// Handle null vector for constant case.” I believe this is a TODO comment.



So, what’s the path forward? At a minimum, and as a first step, I believe my 
original PR for this issue should be approved and merged. That PR is 
https://github.com/apache/iceberg/pull/10284. That PR simply prevents throwing 
a NullPointerException and instead throws an UnsupportedOperationException 
(which is the same exception thrown as the default case for trying to read any 
unsupported column type.)



As I wrote earlier that could be just a first step. What could be the next 
step? Well, I drafted two hacks of the source code to help me explore and 
hopefully inform a potential solution. I want to emphasize that the pull 
requests are hacks and not intended nor proposed to be merged into Apache 
Iceberg.



First, I wrote this hack which explores adding a reader for the missing column. 
https://github.com/apache/iceberg/pull/10922 Ultimately this solution failed 
because, though I hacked together a reader, there was nothing to read inside 
the parquet file. The meat of this change is in 
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java.



Second, I write this hack which changes the dummy holder currently used for 
null values. https://github.com/apache/iceberg/pull/10923 This hack ultimately 
fails for two reasons: 1) The VectorHolder instance I created is hacked 
together with hard coded information that just isn’t available at this point in 
the call stack. 2) I don’t know how to make anything consume the NullAccessor 
class I created to access the null value from the NullVector instance.



I am hoping two things will come out of this email message:

1.      My original pull request will be approved and merged. 
https://github.com/apache/iceberg/pull/10284

2.      One or both of my hacks will inform / inspire someone on this list to 
dream up a better solution and mentor me so that I may code up that solution 
and further explore it.











From: Lessard, Steve <[email protected]>
Date: Thursday, August 1, 2024 at 4:38 PM
To: [email protected] <[email protected]>, Amogh Jahagirdar 
<[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
Hi Amogh, Do you think you could have another look at this issue or point me to 
someone who might be able to help me identify to the root cause and the correct 
fix?



From: Lessard, Steve <[email protected]>
Date: Monday, July 29, 2024 at 5:46 PM
To: [email protected] <[email protected]>, Amogh Jahagirdar 
<[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
Adding Amogh Jahagirdar to the To: line…


From: Lessard, Steve <[email protected]>
Date: Monday, July 29, 2024 at 1:12 PM
To: [email protected] <[email protected]>, 
[email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
Hi Amog,

Did you get a chance to look at this issue? I did some additional investigating 
at your suggested starting point, how we're getting to a state where the reader 
for the new column is a NullVectorReader. Here’s my understanding of what the 
code is doing…

The code creates an ArrowBatchReader instance by calling 
ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call 
TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a 
“Visitor for traversing a Parquet type with a companion Iceberg type.” To me 
this means the only thing that can be built are readers for columns in the 
parquet file with some guidance from the table’s schema. Because the table’s 
schema was changed after the one and only row was written to the table the one 
and only parquet file does not know about the new column. It is not possible to 
build any typed reader because there is no type information in the parquet file 
for the new column.

Further, I cannot find a class named IntVectorReader, nor can I find any 
type-specific reader classes. The only VectorizedReader implementations I can 
find are

  1.  ArrowBatchReader
  2.  BaseBatchReader
  3.  ColumnarBatchReader
  4.  ConstantVectorReader
  5.  DeletedVectorReader
  6.  NullVectorReader
  7.  PositionVectorReader
  8.  VectorizedArrowReader

The unit test I wrote starts with three columns and then adds a fourth column. 
The three original column call all being read via a VectorizedArrowReader 
instance.

I am very new to the Iceberg codebase; there is much I do not know about it. As 
far as I can tell it makes sense that a NullVectorReader instance is being used 
here because there is no data within the one and only parquet file for the new 
column; a null value MUST be read.

Is there some solution I am missing?

-Steve Lessard, Teradata



From: Amogh Jahagirdar <[email protected]>
Date: Wednesday, June 26, 2024 at 10:59 PM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
You don't often get email from [email protected]. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
[CAUTION: External Email]

`Hey Steve,

Thanks for the clear reproduction test case, I think that's very helpful. I did 
some debugging locally, and my suspicion is that it's incorrect/unexpected that 
NullVectorReader being used for reading the new optional column. I could be 
wrong but it seems like we should be allocating a specific typed reader (so for 
the example in the test case an IntVectorReader) . I'll try and look into this 
further sometime this week but at least from my understanding, I'd debug how 
we're getting to a state where the reader for the new column is a 
NullVectorReader and confirm if that's expected or not.

Thanks,

Amogh Jahagirdar

On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve 
<[email protected]> wrote:
I have found unexpected behavior in iceberg-arrow’s vectorized read support. 
After quite a bit of digging and collaboration with Eduard Tudenhoefner we have 
determined that there is a bug in iceberg-arrow, but we have not been able to 
determine exactly what the bug is. Can you please help identify the root cause 
of the issue I originally reported as issue 
10275<https://github.com/apache/iceberg/issues/10275>?

Since I opened that issue I’ve learned a bit more about the issue and now have 
a clear reproduction case. The steps to reproduce the bug are:

  1.  Create a table
  2.  Add one row to the table
  3.  Alter the table’s schema by adding a new, optional column with no default 
value
  4.  Read all rows, all columns from the table
  5.  Blamo! The code currently in apache/iceberg will throw a 
NullPointerException

I have written a unit test that reproduces this bug. You can view the test at 
https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7

Initially I tried to fix the bug by preventing the NullPointerException, but 
all the while I suspected that the NPE is just a symptom of a larger bug. When 
I submitted a pull request containing my fix for the NPE Eduard Tudenhoefner 
reviewed the PR and came to the same conclusion, the NPE is a symptom of a 
larger bug within iceberg-arrow. The problem is neither of us can identify the 
actual bug.

Again, I ask, can you please help identify the root cause of the issue I 
originally reported as issue 
10275<https://github.com/apache/iceberg/issues/10275>?

-Steve Lessard, Teradata

Re: Iceberg-arrow vectorized read bug

Reply via email to