Iceberg-arrow is a non-trivial chunk of code that appears to me as though it has no current owner. Based on the history I find in GitHub it appears the original author of the org.apache.iceberg.arrow.vectorized package, Mayur Srivastava, wrote this package as his one and only contribution to any public repository in GitHub.com and hasn’t been seen since. That was back in April of 2021<https://github.com/apache/iceberg/commit/657de686a1e74e9b5b77fb81e23192cfeecab057>. Now I need to fix a bug in this package and cannot find anyone to mentor me through changing this complex package.
I’ve taken yet another look at this bug in the vectorized Arrow reader code, issue #10275<https://github.com/apache/iceberg/issues/10275>. As a summary the issue is that vectorized reads fail when the parquet file does not contain a column that exists within the Iceberg schema. This is a schema evolution issue. I spent several hours studying the code, learning where and how readers for each column are created, how the vector holder for each column is produced, and how the value for each column is consumed via an ArrowVectorAccessor instance. I have concluded that this isn’t really a bug so much as a never implemented feature. There are two main reasons for my conclusion: 1. The code currently thows a NullPointerException when trying to read a column that exists in the Iceberg schema but does not exist in the parquet schema 2. This comment in the source code for ArrowBatchReader.java line 53<https://github.com/apache/iceberg/blob/8bc1dde5cb587a840476004477e9f5e827ae5d61/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java#L53>, “// Handle null vector for constant case.” I believe this is a TODO comment. So, what’s the path forward? At a minimum, and as a first step, I believe my original PR for this issue should be approved and merged. That PR is https://github.com/apache/iceberg/pull/10284. That PR simply prevents throwing a NullPointerException and instead throws an UnsupportedOperationException (which is the same exception thrown as the default case for trying to read any unsupported column type.) As I wrote earlier that could be just a first step. What could be the next step? Well, I drafted two hacks of the source code to help me explore and hopefully inform a potential solution. I want to emphasize that the pull requests are hacks and not intended nor proposed to be merged into Apache Iceberg. First, I wrote this hack which explores adding a reader for the missing column. https://github.com/apache/iceberg/pull/10922 Ultimately this solution failed because, though I hacked together a reader, there was nothing to read inside the parquet file. The meat of this change is in arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java. Second, I write this hack which changes the dummy holder currently used for null values. https://github.com/apache/iceberg/pull/10923 This hack ultimately fails for two reasons: 1) The VectorHolder instance I created is hacked together with hard coded information that just isn’t available at this point in the call stack. 2) I don’t know how to make anything consume the NullAccessor class I created to access the null value from the NullVector instance. I am hoping two things will come out of this email message: 1. My original pull request will be approved and merged. https://github.com/apache/iceberg/pull/10284 2. One or both of my hacks will inform / inspire someone on this list to dream up a better solution and mentor me so that I may code up that solution and further explore it. From: Lessard, Steve <steve.less...@teradata.com.INVALID> Date: Thursday, August 1, 2024 at 4:38 PM To: dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar <2am...@gmail.com> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug Hi Amogh, Do you think you could have another look at this issue or point me to someone who might be able to help me identify to the root cause and the correct fix? From: Lessard, Steve <steve.less...@teradata.com> Date: Monday, July 29, 2024 at 5:46 PM To: dev@iceberg.apache.org <dev@iceberg.apache.org>, Amogh Jahagirdar <2am...@gmail.com> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug Adding Amogh Jahagirdar to the To: line… From: Lessard, Steve <steve.less...@teradata.com> Date: Monday, July 29, 2024 at 1:12 PM To: dev@iceberg.apache.org <dev@iceberg.apache.org>, steve.less...@teradata.com.invalid <steve.less...@teradata.com.invalid> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug Hi Amog, Did you get a chance to look at this issue? I did some additional investigating at your suggested starting point, how we're getting to a state where the reader for the new column is a NullVectorReader. Here’s my understanding of what the code is doing… The code creates an ArrowBatchReader instance by calling ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a “Visitor for traversing a Parquet type with a companion Iceberg type.” To me this means the only thing that can be built are readers for columns in the parquet file with some guidance from the table’s schema. Because the table’s schema was changed after the one and only row was written to the table the one and only parquet file does not know about the new column. It is not possible to build any typed reader because there is no type information in the parquet file for the new column. Further, I cannot find a class named IntVectorReader, nor can I find any type-specific reader classes. The only VectorizedReader implementations I can find are 1. ArrowBatchReader 2. BaseBatchReader 3. ColumnarBatchReader 4. ConstantVectorReader 5. DeletedVectorReader 6. NullVectorReader 7. PositionVectorReader 8. VectorizedArrowReader The unit test I wrote starts with three columns and then adds a fourth column. The three original column call all being read via a VectorizedArrowReader instance. I am very new to the Iceberg codebase; there is much I do not know about it. As far as I can tell it makes sense that a NullVectorReader instance is being used here because there is no data within the one and only parquet file for the new column; a null value MUST be read. Is there some solution I am missing? -Steve Lessard, Teradata From: Amogh Jahagirdar <2am...@gmail.com> Date: Wednesday, June 26, 2024 at 10:59 PM To: steve.less...@teradata.com.invalid <steve.less...@teradata.com.invalid> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: [EXTERNAL] Re: Iceberg-arrow vectorized read bug You don't often get email from 2am...@gmail.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> [CAUTION: External Email] `Hey Steve, Thanks for the clear reproduction test case, I think that's very helpful. I did some debugging locally, and my suspicion is that it's incorrect/unexpected that NullVectorReader being used for reading the new optional column. I could be wrong but it seems like we should be allocating a specific typed reader (so for the example in the test case an IntVectorReader) . I'll try and look into this further sometime this week but at least from my understanding, I'd debug how we're getting to a state where the reader for the new column is a NullVectorReader and confirm if that's expected or not. Thanks, Amogh Jahagirdar On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve <steve.less...@teradata.com.invalid> wrote: I have found unexpected behavior in iceberg-arrow’s vectorized read support. After quite a bit of digging and collaboration with Eduard Tudenhoefner we have determined that there is a bug in iceberg-arrow, but we have not been able to determine exactly what the bug is. Can you please help identify the root cause of the issue I originally reported as issue 10275<https://github.com/apache/iceberg/issues/10275>? Since I opened that issue I’ve learned a bit more about the issue and now have a clear reproduction case. The steps to reproduce the bug are: 1. Create a table 2. Add one row to the table 3. Alter the table’s schema by adding a new, optional column with no default value 4. Read all rows, all columns from the table 5. Blamo! The code currently in apache/iceberg will throw a NullPointerException I have written a unit test that reproduces this bug. You can view the test at https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7 Initially I tried to fix the bug by preventing the NullPointerException, but all the while I suspected that the NPE is just a symptom of a larger bug. When I submitted a pull request containing my fix for the NPE Eduard Tudenhoefner reviewed the PR and came to the same conclusion, the NPE is a symptom of a larger bug within iceberg-arrow. The problem is neither of us can identify the actual bug. Again, I ask, can you please help identify the root cause of the issue I originally reported as issue 10275<https://github.com/apache/iceberg/issues/10275>? -Steve Lessard, Teradata