Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

Eduard Tudenhöfner Thu, 22 Aug 2024 03:51:37 -0700

Hey Steve,

I spent some time today and investigated the issue. I think you're on
the right track with #10953 <https://github.com/apache/iceberg/pull/10953> and
I left a few comments but overall I think it's close.


Thanks again for your patience on this one.

Eduard

On Fri, Aug 16, 2024 at 10:14 PM Lessard, Steve
<steve.less...@teradata.com.invalid> wrote:

> Hi Eduard,
>
> Thank you for offering to help with this issue. If you are able to find
> some time to look at this issue I’d be sure to make some time to
> collaborate with you.
>
>
>
> I have been continuing to investigate the issue. I still do not have a
> correct solution, but I believe I have something closer to a workable
> solution. I built upon the two pull requests I shared in my previous email
> and came up with the solution in
> https://github.com/apache/iceberg/pull/10953. I annotated that pull
> request to point out areas I know are problematic.
>
>
>
> -Steve Lessard, Teradata.
>
>
>
>
>
>
>
>
>
> *From: *Eduard Tudenhöfner <etudenhoef...@apache.org>
> *Date: *Friday, August 16, 2024 at 2:45 AM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *[EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> [CAUTION: External Email]
>
>
>
> Hey Steve,
>
>
>
> It's been a long time since I did some work on the iceberg-arrow module
> but I will try to find some time next week to analyze the problem in detail
> and see what options we have for fixing it.
>
>
>
> Thanks for your patience here.
>
>
>
> Eduard
>
>
>
> On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve
> <steve.less...@teradata.com.invalid> wrote:
>
> Iceberg-arrow is a non-trivial chunk of code that appears to me as though it 
> has no current owner. Based on the history I find in GitHub it appears the 
> original author of the org.apache.iceberg.arrow.vectorized package, Mayur 
> Srivastava, wrote this package as his one and only contribution to any public 
> repository in GitHub.com and hasn’t been seen since.  That was back in April 
> of 2021 
> <https://github.com/apache/iceberg/commit/657de686a1e74e9b5b77fb81e23192cfeecab057>.
>  Now I need to fix a bug in this package and cannot find anyone to mentor me 
> through changing this complex package.
>
>
>
> I’ve taken yet another look at this bug in the vectorized Arrow reader code, 
> issue #10275 <https://github.com/apache/iceberg/issues/10275>. As a summary 
> the issue is that vectorized reads fail when the parquet file does not 
> contain a column that exists within the Iceberg schema. This is a schema 
> evolution issue. I spent several hours studying the code, learning where and 
> how readers for each column are created, how the vector holder for each 
> column is produced, and how the value for each column is consumed via an 
> ArrowVectorAccessor instance. I have concluded that this isn’t really a bug 
> so much as a never implemented feature. There are two main reasons for my 
> conclusion:
>
> 1.      The code currently thows a NullPointerException when trying to read a 
> column that exists in the Iceberg schema but does not exist in the parquet 
> schema
>
> 2.      This comment in the source code for ArrowBatchReader.java line 53 
> <https://github.com/apache/iceberg/blob/8bc1dde5cb587a840476004477e9f5e827ae5d61/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java#L53>,
>  “// Handle null vector for constant case.” I believe this is a TODO comment.
>
>
>
> So, what’s the path forward? At a minimum, and as a first step, I believe my 
> original PR for this issue should be approved and merged. That PR is 
> https://github.com/apache/iceberg/pull/10284. That PR simply prevents 
> throwing a NullPointerException and instead throws an 
> UnsupportedOperationException (which is the same exception thrown as the 
> default case for trying to read any unsupported column type.)
>
>
>
> As I wrote earlier that could be just a first step. What could be the next 
> step? Well, I drafted two hacks of the source code to help me explore and 
> hopefully inform a potential solution. I want to emphasize that the pull 
> requests are hacks and not intended nor proposed to be merged into Apache 
> Iceberg.
>
>
>
> First, I wrote this hack which explores adding a reader for the missing 
> column. https://github.com/apache/iceberg/pull/10922 Ultimately this solution 
> failed because, though I hacked together a reader, there was nothing to read 
> inside the parquet file. The meat of this change is in 
> arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java.
>
>
>
> Second, I write this hack which changes the dummy holder currently used for 
> null values. https://github.com/apache/iceberg/pull/10923 This hack 
> ultimately fails for two reasons: 1) The VectorHolder instance I created is 
> hacked together with hard coded information that just isn’t available at this 
> point in the call stack. 2) I don’t know how to make anything consume the 
> NullAccessor class I created to access the null value from the NullVector 
> instance.
>
>
>
> I am hoping two things will come out of this email message:
>
> 1.      My original pull request will be approved and merged. 
> https://github.com/apache/iceberg/pull/10284
>
> 2.      One or both of my hacks will inform / inspire someone on this list to 
> dream up a better solution and mentor me so that I may code up that solution 
> and further explore it.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Lessard, Steve <steve.less...@teradata.com.INVALID>
> *Date: *Thursday, August 1, 2024 at 4:38 PM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar <
> 2am...@gmail.com>
> *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Hi Amogh, Do you think you could have another look at this issue or point
> me to someone who might be able to help me identify to the root cause and
> the correct fix?
>
>
>
>
>
>
>
> *From: *Lessard, Steve <steve.less...@teradata.com>
> *Date: *Monday, July 29, 2024 at 5:46 PM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar <
> 2am...@gmail.com>
> *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Adding Amogh Jahagirdar to the To: line…
>
>
>
>
>
> *From: *Lessard, Steve <steve.less...@teradata.com>
> *Date: *Monday, July 29, 2024 at 1:12 PM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>,
> steve.less...@teradata.com.invalid <steve.less...@teradata.com.invalid>
> *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Hi Amog,
>
>
>
> Did you get a chance to look at this issue? I did some additional
> investigating at your suggested starting point, how we're getting to a
> state where the reader for the new column is a NullVectorReader. Here’s my
> understanding of what the code is doing…
>
>
>
> The code creates an ArrowBatchReader instance by calling
> ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call
> TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a
> “*Visitor for traversing a Parquet type with a companion Iceberg type.*”
> To me this means the only thing that can be built are readers for columns
> in the parquet file with some guidance from the table’s schema. Because the
> table’s schema was changed after the one and only row was written to the
> table the one and only parquet file does not know about the new column. It
> is not possible to build any typed reader because there is no type
> information in the parquet file for the new column.
>
>
>
> Further, I cannot find a class named IntVectorReader, nor can I find any
> type-specific reader classes. The only VectorizedReader implementations I
> can find are
>
>    1. ArrowBatchReader
>    2. BaseBatchReader
>    3. ColumnarBatchReader
>    4. ConstantVectorReader
>    5. DeletedVectorReader
>    6. NullVectorReader
>    7. PositionVectorReader
>    8. VectorizedArrowReader
>
>
>
> The unit test I wrote starts with three columns and then adds a fourth
> column. The three original column call all being read via a
> VectorizedArrowReader instance.
>
>
>
> I am very new to the Iceberg codebase; there is much I do not know about
> it. As far as I can tell it makes sense that a NullVectorReader instance is
> being used here because there is no data within the one and only parquet
> file for the new column; a null value MUST be read.
>
>
>
> Is there some solution I am missing?
>
>
>
> -Steve Lessard, Teradata
>
>
>
>
>
>
>
> *From: *Amogh Jahagirdar <2am...@gmail.com>
> *Date: *Wednesday, June 26, 2024 at 10:59 PM
> *To: *steve.less...@teradata.com.invalid
> <steve.less...@teradata.com.invalid>
> *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *[EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> You don't often get email from 2am...@gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
> [CAUTION: External Email]
>
>
>
> `Hey Steve,
>
>
>
> Thanks for the clear reproduction test case, I think that's very helpful.
> I did some debugging locally, and my suspicion is that it's
> incorrect/unexpected that NullVectorReader being used for reading the new
> optional column. I could be wrong but it seems like we should be allocating
> a specific typed reader (so for the example in the test case an
> IntVectorReader) . I'll try and look into this further sometime this week
> but at least from my understanding, I'd debug how we're getting to a state
> where the reader for the new column is a NullVectorReader and confirm if
> that's expected or not.
>
>
>
> Thanks,
>
>
>
> Amogh Jahagirdar
>
>
>
> On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve
> <steve.less...@teradata.com.invalid> wrote:
>
> I have found unexpected behavior in iceberg-arrow’s vectorized read
> support. After quite a bit of digging and collaboration with Eduard
> Tudenhoefner we have determined that there is a bug in iceberg-arrow, but
> we have not been able to determine exactly what the bug is. Can you please
> help identify the root cause of the issue I originally reported as issue
> 10275 <https://github.com/apache/iceberg/issues/10275>?
>
>
>
> Since I opened that issue I’ve learned a bit more about the issue and now
> have a clear reproduction case. The steps to reproduce the bug are:
>
>    1. Create a table
>    2. Add one row to the table
>    3. Alter the table’s schema by adding a new, optional column with no
>    default value
>    4. Read all rows, all columns from the table
>    5. Blamo! The code currently in apache/iceberg will throw a
>    NullPointerException
>
>
>
> I have written a unit test that reproduces this bug. You can view the test
> at
> https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7
>
>
>
> Initially I tried to fix the bug by preventing the NullPointerException,
> but all the while I suspected that the NPE is just a symptom of a larger
> bug. When I submitted a pull request containing my fix for the NPE Eduard
> Tudenhoefner reviewed the PR and came to the same conclusion, the NPE is a
> symptom of a larger bug within iceberg-arrow. The problem is neither of us
> can identify the actual bug.
>
>
>
> Again, I ask, can you please help identify the root cause of the issue I
> originally reported as issue 10275
> <https://github.com/apache/iceberg/issues/10275>?
>
>
>
> -Steve Lessard, Teradata
>
>
>
>
>
>

Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

Reply via email to