Hey Steve, I spent some time today and investigated the issue. I think you're on the right track with #10953 <https://github.com/apache/iceberg/pull/10953> and I left a few comments but overall I think it's close.
Thanks again for your patience on this one. Eduard On Fri, Aug 16, 2024 at 10:14 PM Lessard, Steve <steve.less...@teradata.com.invalid> wrote: > Hi Eduard, > > Thank you for offering to help with this issue. If you are able to find > some time to look at this issue I’d be sure to make some time to > collaborate with you. > > > > I have been continuing to investigate the issue. I still do not have a > correct solution, but I believe I have something closer to a workable > solution. I built upon the two pull requests I shared in my previous email > and came up with the solution in > https://github.com/apache/iceberg/pull/10953. I annotated that pull > request to point out areas I know are problematic. > > > > -Steve Lessard, Teradata. > > > > > > > > > > *From: *Eduard Tudenhöfner <etudenhoef...@apache.org> > *Date: *Friday, August 16, 2024 at 2:45 AM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *[EXTERNAL] Re: Iceberg-arrow vectorized read bug > > [CAUTION: External Email] > > > > Hey Steve, > > > > It's been a long time since I did some work on the iceberg-arrow module > but I will try to find some time next week to analyze the problem in detail > and see what options we have for fixing it. > > > > Thanks for your patience here. > > > > Eduard > > > > On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve > <steve.less...@teradata.com.invalid> wrote: > > Iceberg-arrow is a non-trivial chunk of code that appears to me as though it > has no current owner. Based on the history I find in GitHub it appears the > original author of the org.apache.iceberg.arrow.vectorized package, Mayur > Srivastava, wrote this package as his one and only contribution to any public > repository in GitHub.com and hasn’t been seen since. That was back in April > of 2021 > <https://github.com/apache/iceberg/commit/657de686a1e74e9b5b77fb81e23192cfeecab057>. > Now I need to fix a bug in this package and cannot find anyone to mentor me > through changing this complex package. > > > > I’ve taken yet another look at this bug in the vectorized Arrow reader code, > issue #10275 <https://github.com/apache/iceberg/issues/10275>. As a summary > the issue is that vectorized reads fail when the parquet file does not > contain a column that exists within the Iceberg schema. This is a schema > evolution issue. I spent several hours studying the code, learning where and > how readers for each column are created, how the vector holder for each > column is produced, and how the value for each column is consumed via an > ArrowVectorAccessor instance. I have concluded that this isn’t really a bug > so much as a never implemented feature. There are two main reasons for my > conclusion: > > 1. The code currently thows a NullPointerException when trying to read a > column that exists in the Iceberg schema but does not exist in the parquet > schema > > 2. This comment in the source code for ArrowBatchReader.java line 53 > <https://github.com/apache/iceberg/blob/8bc1dde5cb587a840476004477e9f5e827ae5d61/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java#L53>, > “// Handle null vector for constant case.” I believe this is a TODO comment. > > > > So, what’s the path forward? At a minimum, and as a first step, I believe my > original PR for this issue should be approved and merged. That PR is > https://github.com/apache/iceberg/pull/10284. That PR simply prevents > throwing a NullPointerException and instead throws an > UnsupportedOperationException (which is the same exception thrown as the > default case for trying to read any unsupported column type.) > > > > As I wrote earlier that could be just a first step. What could be the next > step? Well, I drafted two hacks of the source code to help me explore and > hopefully inform a potential solution. I want to emphasize that the pull > requests are hacks and not intended nor proposed to be merged into Apache > Iceberg. > > > > First, I wrote this hack which explores adding a reader for the missing > column. https://github.com/apache/iceberg/pull/10922 Ultimately this solution > failed because, though I hacked together a reader, there was nothing to read > inside the parquet file. The meat of this change is in > arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java. > > > > Second, I write this hack which changes the dummy holder currently used for > null values. https://github.com/apache/iceberg/pull/10923 This hack > ultimately fails for two reasons: 1) The VectorHolder instance I created is > hacked together with hard coded information that just isn’t available at this > point in the call stack. 2) I don’t know how to make anything consume the > NullAccessor class I created to access the null value from the NullVector > instance. > > > > I am hoping two things will come out of this email message: > > 1. My original pull request will be approved and merged. > https://github.com/apache/iceberg/pull/10284 > > 2. One or both of my hacks will inform / inspire someone on this list to > dream up a better solution and mentor me so that I may code up that solution > and further explore it. > > > > > > > > > > > > > > > > *From: *Lessard, Steve <steve.less...@teradata.com.INVALID> > *Date: *Thursday, August 1, 2024 at 4:38 PM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar < > 2am...@gmail.com> > *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug > > Hi Amogh, Do you think you could have another look at this issue or point > me to someone who might be able to help me identify to the root cause and > the correct fix? > > > > > > > > *From: *Lessard, Steve <steve.less...@teradata.com> > *Date: *Monday, July 29, 2024 at 5:46 PM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar < > 2am...@gmail.com> > *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug > > Adding Amogh Jahagirdar to the To: line… > > > > > > *From: *Lessard, Steve <steve.less...@teradata.com> > *Date: *Monday, July 29, 2024 at 1:12 PM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>, > steve.less...@teradata.com.invalid <steve.less...@teradata.com.invalid> > *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug > > Hi Amog, > > > > Did you get a chance to look at this issue? I did some additional > investigating at your suggested starting point, how we're getting to a > state where the reader for the new column is a NullVectorReader. Here’s my > understanding of what the code is doing… > > > > The code creates an ArrowBatchReader instance by calling > ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call > TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a > “*Visitor for traversing a Parquet type with a companion Iceberg type.*” > To me this means the only thing that can be built are readers for columns > in the parquet file with some guidance from the table’s schema. Because the > table’s schema was changed after the one and only row was written to the > table the one and only parquet file does not know about the new column. It > is not possible to build any typed reader because there is no type > information in the parquet file for the new column. > > > > Further, I cannot find a class named IntVectorReader, nor can I find any > type-specific reader classes. The only VectorizedReader implementations I > can find are > > 1. ArrowBatchReader > 2. BaseBatchReader > 3. ColumnarBatchReader > 4. ConstantVectorReader > 5. DeletedVectorReader > 6. NullVectorReader > 7. PositionVectorReader > 8. VectorizedArrowReader > > > > The unit test I wrote starts with three columns and then adds a fourth > column. The three original column call all being read via a > VectorizedArrowReader instance. > > > > I am very new to the Iceberg codebase; there is much I do not know about > it. As far as I can tell it makes sense that a NullVectorReader instance is > being used here because there is no data within the one and only parquet > file for the new column; a null value MUST be read. > > > > Is there some solution I am missing? > > > > -Steve Lessard, Teradata > > > > > > > > *From: *Amogh Jahagirdar <2am...@gmail.com> > *Date: *Wednesday, June 26, 2024 at 10:59 PM > *To: *steve.less...@teradata.com.invalid > <steve.less...@teradata.com.invalid> > *Cc: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *[EXTERNAL] Re: Iceberg-arrow vectorized read bug > > You don't often get email from 2am...@gmail.com. Learn why this is > important <https://aka.ms/LearnAboutSenderIdentification> > > [CAUTION: External Email] > > > > `Hey Steve, > > > > Thanks for the clear reproduction test case, I think that's very helpful. > I did some debugging locally, and my suspicion is that it's > incorrect/unexpected that NullVectorReader being used for reading the new > optional column. I could be wrong but it seems like we should be allocating > a specific typed reader (so for the example in the test case an > IntVectorReader) . I'll try and look into this further sometime this week > but at least from my understanding, I'd debug how we're getting to a state > where the reader for the new column is a NullVectorReader and confirm if > that's expected or not. > > > > Thanks, > > > > Amogh Jahagirdar > > > > On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve > <steve.less...@teradata.com.invalid> wrote: > > I have found unexpected behavior in iceberg-arrow’s vectorized read > support. After quite a bit of digging and collaboration with Eduard > Tudenhoefner we have determined that there is a bug in iceberg-arrow, but > we have not been able to determine exactly what the bug is. Can you please > help identify the root cause of the issue I originally reported as issue > 10275 <https://github.com/apache/iceberg/issues/10275>? > > > > Since I opened that issue I’ve learned a bit more about the issue and now > have a clear reproduction case. The steps to reproduce the bug are: > > 1. Create a table > 2. Add one row to the table > 3. Alter the table’s schema by adding a new, optional column with no > default value > 4. Read all rows, all columns from the table > 5. Blamo! The code currently in apache/iceberg will throw a > NullPointerException > > > > I have written a unit test that reproduces this bug. You can view the test > at > https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7 > > > > Initially I tried to fix the bug by preventing the NullPointerException, > but all the while I suspected that the NPE is just a symptom of a larger > bug. When I submitted a pull request containing my fix for the NPE Eduard > Tudenhoefner reviewed the PR and came to the same conclusion, the NPE is a > symptom of a larger bug within iceberg-arrow. The problem is neither of us > can identify the actual bug. > > > > Again, I ask, can you please help identify the root cause of the issue I > originally reported as issue 10275 > <https://github.com/apache/iceberg/issues/10275>? > > > > -Steve Lessard, Teradata > > > > > >