Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

Lessard, Steve Thu, 01 Aug 2024 16:38:06 -0700

Hi Amogh, Do you think you could have another look at this issue or point me to 
someone who might be able to help me identify to the root cause and the correct 
fix?

From: Lessard, Steve <[email protected]>
Date: Monday, July 29, 2024 at 5:46 PM
To: [email protected] <[email protected]>, Amogh Jahagirdar 
<[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
Adding Amogh Jahagirdar to the To: line…

From: Lessard, Steve <[email protected]>
Date: Monday, July 29, 2024 at 1:12 PM
To: [email protected] <[email protected]>, 
[email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
Hi Amog,

Did you get a chance to look at this issue? I did some additional investigating 
at your suggested starting point, how we're getting to a state where the reader 
for the new column is a NullVectorReader. Here’s my understanding of what the 
code is doing…

The code creates an ArrowBatchReader instance by calling 
ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call 
TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a 
“Visitor for traversing a Parquet type with a companion Iceberg type.” To me 
this means the only thing that can be built are readers for columns in the 
parquet file with some guidance from the table’s schema. Because the table’s 
schema was changed after the one and only row was written to the table the one 
and only parquet file does not know about the new column. It is not possible to 
build any typed reader because there is no type information in the parquet file 
for the new column.

Further, I cannot find a class named IntVectorReader, nor can I find any 
type-specific reader classes. The only VectorizedReader implementations I can 
find are

  1.  ArrowBatchReader
  2.  BaseBatchReader
  3.  ColumnarBatchReader
  4.  ConstantVectorReader
  5.  DeletedVectorReader
  6.  NullVectorReader
  7.  PositionVectorReader
  8.  VectorizedArrowReader

The unit test I wrote starts with three columns and then adds a fourth column. 
The three original column call all being read via a VectorizedArrowReader 
instance.

I am very new to the Iceberg codebase; there is much I do not know about it. As 
far as I can tell it makes sense that a NullVectorReader instance is being used 
here because there is no data within the one and only parquet file for the new 
column; a null value MUST be read.

Is there some solution I am missing?

-Steve Lessard, Teradata

From: Amogh Jahagirdar <[email protected]>
Date: Wednesday, June 26, 2024 at 10:59 PM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
You don't often get email from [email protected]. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
[CAUTION: External Email]

`Hey Steve,

Thanks for the clear reproduction test case, I think that's very helpful. I did 
some debugging locally, and my suspicion is that it's incorrect/unexpected that 
NullVectorReader being used for reading the new optional column. I could be 
wrong but it seems like we should be allocating a specific typed reader (so for 
the example in the test case an IntVectorReader) . I'll try and look into this 
further sometime this week but at least from my understanding, I'd debug how 
we're getting to a state where the reader for the new column is a 
NullVectorReader and confirm if that's expected or not.

Thanks,

Amogh Jahagirdar

On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve 
<[email protected]> wrote:
I have found unexpected behavior in iceberg-arrow’s vectorized read support. 
After quite a bit of digging and collaboration with Eduard Tudenhoefner we have 
determined that there is a bug in iceberg-arrow, but we have not been able to 
determine exactly what the bug is. Can you please help identify the root cause 
of the issue I originally reported as issue 
10275<https://github.com/apache/iceberg/issues/10275>?

Since I opened that issue I’ve learned a bit more about the issue and now have 
a clear reproduction case. The steps to reproduce the bug are:

  1.  Create a table
  2.  Add one row to the table
  3.  Alter the table’s schema by adding a new, optional column with no default 
value
  4.  Read all rows, all columns from the table
  5.  Blamo! The code currently in apache/iceberg will throw a 
NullPointerException

I have written a unit test that reproduces this bug. You can view the test at 
https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7

Initially I tried to fix the bug by preventing the NullPointerException, but 
all the while I suspected that the NPE is just a symptom of a larger bug. When 
I submitted a pull request containing my fix for the NPE Eduard Tudenhoefner 
reviewed the PR and came to the same conclusion, the NPE is a symptom of a 
larger bug within iceberg-arrow. The problem is neither of us can identify the 
actual bug.

Again, I ask, can you please help identify the root cause of the issue I 
originally reported as issue 
10275<https://github.com/apache/iceberg/issues/10275>?

-Steve Lessard, Teradata

Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

Reply via email to