Re: Iceberg-arrow vectorized read bug

Eduard Tudenhöfner Fri, 16 Aug 2024 02:45:45 -0700

Hey Steve,

It's been a long time since I did some work on the iceberg-arrow module but
I will try to find some time next week to analyze the problem in detail and
see what options we have for fixing it.


Thanks for your patience here.

Eduard

On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve
<[email protected]> wrote:

> Iceberg-arrow is a non-trivial chunk of code that appears to me as though it 
> has no current owner. Based on the history I find in GitHub it appears the 
> original author of the org.apache.iceberg.arrow.vectorized package, Mayur 
> Srivastava, wrote this package as his one and only contribution to any public 
> repository in GitHub.com and hasn’t been seen since.  That was back in April 
> of 2021 
> <https://github.com/apache/iceberg/commit/657de686a1e74e9b5b77fb81e23192cfeecab057>.
>  Now I need to fix a bug in this package and cannot find anyone to mentor me 
> through changing this complex package.
>
>
>
> I’ve taken yet another look at this bug in the vectorized Arrow reader code, 
> issue #10275 <https://github.com/apache/iceberg/issues/10275>. As a summary 
> the issue is that vectorized reads fail when the parquet file does not 
> contain a column that exists within the Iceberg schema. This is a schema 
> evolution issue. I spent several hours studying the code, learning where and 
> how readers for each column are created, how the vector holder for each 
> column is produced, and how the value for each column is consumed via an 
> ArrowVectorAccessor instance. I have concluded that this isn’t really a bug 
> so much as a never implemented feature. There are two main reasons for my 
> conclusion:
>
> 1.      The code currently thows a NullPointerException when trying to read a 
> column that exists in the Iceberg schema but does not exist in the parquet 
> schema
>
> 2.      This comment in the source code for ArrowBatchReader.java line 53 
> <https://github.com/apache/iceberg/blob/8bc1dde5cb587a840476004477e9f5e827ae5d61/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java#L53>,
>  “// Handle null vector for constant case.” I believe this is a TODO comment.
>
>
>
> So, what’s the path forward? At a minimum, and as a first step, I believe my 
> original PR for this issue should be approved and merged. That PR is 
> https://github.com/apache/iceberg/pull/10284. That PR simply prevents 
> throwing a NullPointerException and instead throws an 
> UnsupportedOperationException (which is the same exception thrown as the 
> default case for trying to read any unsupported column type.)
>
>
>
> As I wrote earlier that could be just a first step. What could be the next 
> step? Well, I drafted two hacks of the source code to help me explore and 
> hopefully inform a potential solution. I want to emphasize that the pull 
> requests are hacks and not intended nor proposed to be merged into Apache 
> Iceberg.
>
>
>
> First, I wrote this hack which explores adding a reader for the missing 
> column. https://github.com/apache/iceberg/pull/10922 Ultimately this solution 
> failed because, though I hacked together a reader, there was nothing to read 
> inside the parquet file. The meat of this change is in 
> arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java.
>
>
>
> Second, I write this hack which changes the dummy holder currently used for 
> null values. https://github.com/apache/iceberg/pull/10923 This hack 
> ultimately fails for two reasons: 1) The VectorHolder instance I created is 
> hacked together with hard coded information that just isn’t available at this 
> point in the call stack. 2) I don’t know how to make anything consume the 
> NullAccessor class I created to access the null value from the NullVector 
> instance.
>
>
>
> I am hoping two things will come out of this email message:
>
> 1.      My original pull request will be approved and merged. 
> https://github.com/apache/iceberg/pull/10284
>
> 2.      One or both of my hacks will inform / inspire someone on this list to 
> dream up a better solution and mentor me so that I may code up that solution 
> and further explore it.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Lessard, Steve <[email protected]>
> *Date: *Thursday, August 1, 2024 at 4:38 PM
> *To: *[email protected] <[email protected]>, Amogh Jahagirdar <
> [email protected]>
> *Cc: *[email protected] <[email protected]>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Hi Amogh, Do you think you could have another look at this issue or point
> me to someone who might be able to help me identify to the root cause and
> the correct fix?
>
>
>
>
>
>
>
> *From: *Lessard, Steve <[email protected]>
> *Date: *Monday, July 29, 2024 at 5:46 PM
> *To: *[email protected] <[email protected]>, Amogh Jahagirdar <
> [email protected]>
> *Cc: *[email protected] <[email protected]>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Adding Amogh Jahagirdar to the To: line…
>
>
>
>
>
> *From: *Lessard, Steve <[email protected]>
> *Date: *Monday, July 29, 2024 at 1:12 PM
> *To: *[email protected] <[email protected]>,
> [email protected] <[email protected]>
> *Cc: *[email protected] <[email protected]>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Hi Amog,
>
>
>
> Did you get a chance to look at this issue? I did some additional
> investigating at your suggested starting point, how we're getting to a
> state where the reader for the new column is a NullVectorReader. Here’s my
> understanding of what the code is doing…
>
>
>
> The code creates an ArrowBatchReader instance by calling
> ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call
> TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a
> “*Visitor for traversing a Parquet type with a companion Iceberg type.*”
> To me this means the only thing that can be built are readers for columns
> in the parquet file with some guidance from the table’s schema. Because the
> table’s schema was changed after the one and only row was written to the
> table the one and only parquet file does not know about the new column. It
> is not possible to build any typed reader because there is no type
> information in the parquet file for the new column.
>
>
>
> Further, I cannot find a class named IntVectorReader, nor can I find any
> type-specific reader classes. The only VectorizedReader implementations I
> can find are
>
>    1. ArrowBatchReader
>    2. BaseBatchReader
>    3. ColumnarBatchReader
>    4. ConstantVectorReader
>    5. DeletedVectorReader
>    6. NullVectorReader
>    7. PositionVectorReader
>    8. VectorizedArrowReader
>
>
>
> The unit test I wrote starts with three columns and then adds a fourth
> column. The three original column call all being read via a
> VectorizedArrowReader instance.
>
>
>
> I am very new to the Iceberg codebase; there is much I do not know about
> it. As far as I can tell it makes sense that a NullVectorReader instance is
> being used here because there is no data within the one and only parquet
> file for the new column; a null value MUST be read.
>
>
>
> Is there some solution I am missing?
>
>
>
> -Steve Lessard, Teradata
>
>
>
>
>
>
>
> *From: *Amogh Jahagirdar <[email protected]>
> *Date: *Wednesday, June 26, 2024 at 10:59 PM
> *To: *[email protected]
> <[email protected]>
> *Cc: *[email protected] <[email protected]>
> *Subject: *[EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> You don't often get email from [email protected]. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
> [CAUTION: External Email]
>
>
>
> `Hey Steve,
>
>
>
> Thanks for the clear reproduction test case, I think that's very helpful.
> I did some debugging locally, and my suspicion is that it's
> incorrect/unexpected that NullVectorReader being used for reading the new
> optional column. I could be wrong but it seems like we should be allocating
> a specific typed reader (so for the example in the test case an
> IntVectorReader) . I'll try and look into this further sometime this week
> but at least from my understanding, I'd debug how we're getting to a state
> where the reader for the new column is a NullVectorReader and confirm if
> that's expected or not.
>
>
>
> Thanks,
>
>
>
> Amogh Jahagirdar
>
>
>
> On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve
> <[email protected]> wrote:
>
> I have found unexpected behavior in iceberg-arrow’s vectorized read
> support. After quite a bit of digging and collaboration with Eduard
> Tudenhoefner we have determined that there is a bug in iceberg-arrow, but
> we have not been able to determine exactly what the bug is. Can you please
> help identify the root cause of the issue I originally reported as issue
> 10275 <https://github.com/apache/iceberg/issues/10275>?
>
>
>
> Since I opened that issue I’ve learned a bit more about the issue and now
> have a clear reproduction case. The steps to reproduce the bug are:
>
>    1. Create a table
>    2. Add one row to the table
>    3. Alter the table’s schema by adding a new, optional column with no
>    default value
>    4. Read all rows, all columns from the table
>    5. Blamo! The code currently in apache/iceberg will throw a
>    NullPointerException
>
>
>
> I have written a unit test that reproduces this bug. You can view the test
> at
> https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7
>
>
>
> Initially I tried to fix the bug by preventing the NullPointerException,
> but all the while I suspected that the NPE is just a symptom of a larger
> bug. When I submitted a pull request containing my fix for the NPE Eduard
> Tudenhoefner reviewed the PR and came to the same conclusion, the NPE is a
> symptom of a larger bug within iceberg-arrow. The problem is neither of us
> can identify the actual bug.
>
>
>
> Again, I ask, can you please help identify the root cause of the issue I
> originally reported as issue 10275
> <https://github.com/apache/iceberg/issues/10275>?
>
>
>
> -Steve Lessard, Teradata
>
>
>
>
>
>

Re: Iceberg-arrow vectorized read bug

Reply via email to