Hi Amog, Did you get a chance to look at this issue? I did some additional investigating at your suggested starting point, how we're getting to a state where the reader for the new column is a NullVectorReader. Here’s my understanding of what the code is doing…
The code creates an ArrowBatchReader instance by calling ArrowReader.buildReader. ArrowReader.buildReader is hard coded to call TypeWithSchemaVisitor.visit. TypeWIthSchemaVisitor is documented as being a “Visitor for traversing a Parquet type with a companion Iceberg type.” To me this means the only thing that can be built are readers for columns in the parquet file with some guidance from the table’s schema. Because the table’s schema was changed after the one and only row was written to the table the one and only parquet file does not know about the new column. It is not possible to build any typed reader because there is no type information in the parquet file for the new column. Further, I cannot find a class named IntVectorReader, nor can I find any type-specific reader classes. The only VectorizedReader implementations I can find are 1. ArrowBatchReader 2. BaseBatchReader 3. ColumnarBatchReader 4. ConstantVectorReader 5. DeletedVectorReader 6. NullVectorReader 7. PositionVectorReader 8. VectorizedArrowReader The unit test I wrote starts with three columns and then adds a fourth column. The three original column call all being read via a VectorizedArrowReader instance. I am very new to the Iceberg codebase; there is much I do not know about it. As far as I can tell it makes sense that a NullVectorReader instance is being used here because there is no data within the one and only parquet file for the new column; a null value MUST be read. Is there some solution I am missing? -Steve Lessard, Teradata From: Amogh Jahagirdar <2am...@gmail.com> Date: Wednesday, June 26, 2024 at 10:59 PM To: steve.less...@teradata.com.invalid <steve.less...@teradata.com.invalid> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: [EXTERNAL] Re: Iceberg-arrow vectorized read bug You don't often get email from 2am...@gmail.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> [CAUTION: External Email] `Hey Steve, Thanks for the clear reproduction test case, I think that's very helpful. I did some debugging locally, and my suspicion is that it's incorrect/unexpected that NullVectorReader being used for reading the new optional column. I could be wrong but it seems like we should be allocating a specific typed reader (so for the example in the test case an IntVectorReader) . I'll try and look into this further sometime this week but at least from my understanding, I'd debug how we're getting to a state where the reader for the new column is a NullVectorReader and confirm if that's expected or not. Thanks, Amogh Jahagirdar On Wed, Jun 26, 2024 at 6:05 PM Lessard, Steve <steve.less...@teradata.com.invalid> wrote: I have found unexpected behavior in iceberg-arrow’s vectorized read support. After quite a bit of digging and collaboration with Eduard Tudenhoefner we have determined that there is a bug in iceberg-arrow, but we have not been able to determine exactly what the bug is. Can you please help identify the root cause of the issue I originally reported as issue 10275<https://github.com/apache/iceberg/issues/10275>? Since I opened that issue I’ve learned a bit more about the issue and now have a clear reproduction case. The steps to reproduce the bug are: 1. Create a table 2. Add one row to the table 3. Alter the table’s schema by adding a new, optional column with no default value 4. Read all rows, all columns from the table 5. Blamo! The code currently in apache/iceberg will throw a NullPointerException I have written a unit test that reproduces this bug. You can view the test at https://github.com/apache/iceberg/pull/10284/files#diff-c3da34dcdb02c2db690c86a2b8356a405c899dec410bdb0b9bcee79fd8c63dc7 Initially I tried to fix the bug by preventing the NullPointerException, but all the while I suspected that the NPE is just a symptom of a larger bug. When I submitted a pull request containing my fix for the NPE Eduard Tudenhoefner reviewed the PR and came to the same conclusion, the NPE is a symptom of a larger bug within iceberg-arrow. The problem is neither of us can identify the actual bug. Again, I ask, can you please help identify the root cause of the issue I originally reported as issue 10275<https://github.com/apache/iceberg/issues/10275>? -Steve Lessard, Teradata