[ https://issues.apache.org/jira/browse/ARROW-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479652#comment-17479652 ]
Will Jones commented on ARROW-14047: ------------------------------------ A couple further notes: * The issue doesn't seem to be just with the parquet file. I can save a new file with the latest Arrow that has the same read issue. * The issue seems to be related to overall structure rather than any particular row. If I remove any single row (doesn't matter which one) the issue disappears. * The first read is always good, and the sequence of valid and invalid reads seems to be deterministic, though the pattern isn't obvious. * An invalid read can be detected with {{ValidateFull()}} on the {{recordList}} array. Here is the error message it yields: {code} Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861) {code} > [C++] [Parquet] FileReader returns inconsistent results on repeat reads > ----------------------------------------------------------------------- > > Key: ARROW-14047 > URL: https://issues.apache.org/jira/browse/ARROW-14047 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 5.0.0 > Environment: Centos 7 gcc 9.2.0 > Reporter: Radu Teodorescu > Assignee: Will Jones > Priority: Major > Labels: pull-request-available > Attachments: Capture.PNG, writeReadRowGroup.parquet > > Time Spent: 20m > Remaining Estimate: 0h > > We are seeing that for certain data sets when dealing with lists of structs, > repeated reads yield different results - I have a file that exhibits this > behavior and below is the code for reproducing it: > {code:java} > filesystem::path filePath = dirPath / "writeReadRowGroup.parquet"; > arrow::MemoryPool *pool = arrow::default_memory_pool(); > std::shared_ptr<arrow::io::ReadableFile> infile; > PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, > pool)); > std::unique_ptr<parquet::arrow::FileReader> arrow_reader; > auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader); > CHECK_OK(status); std::shared_ptr<arrow::Schema> readSchema; > CHECK_OK(arrow_reader->GetSchema(&readSchema)); > std::shared_ptr<arrow::Table> table; > std::vector<int> indicesToGet; > CHECK_OK(arrow_reader->ReadTable(&table)); auto recordListCol1 = > arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}), > > {table->GetColumnByName("recordList")}); for (int i = 0; i < 20; ++i) { > cout << "data reread operation number = " + std::to_string(i) << endl; > std::shared_ptr<arrow::Table> table2; > CHECK_OK(arrow_reader->ReadTable(&table2)); > auto recordListCol2 = > arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}), > > {table2->GetColumnByName("recordList")}); > bool equals = recordListCol1->Equals(*recordListCol2); > if (!equals) { > cout << recordListCol1->ToString() << endl; > cout << endl << "new table" << endl; > cout << recordListCol2->ToString() << endl; > throw std::runtime_error("Subsequent re-read failure "); > } } > {code} > Apparently, as shown in the attached capture the state machine used to track > nulls is broken on subsequent usage > -- This message was sent by Atlassian Jira (v8.20.1#820001)