[ 
https://issues.apache.org/jira/browse/ARROW-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479652#comment-17479652
 ] 

Will Jones commented on ARROW-14047:
------------------------------------

A couple further notes:

*  The issue doesn't seem to be just with the parquet file. I can save a new 
file with the latest Arrow that has the same read issue.
*  The issue seems to be related to overall structure rather than any 
particular row. If I remove any single row (doesn't matter which one) the issue 
disappears.
* The first read is always good, and the sequence of valid and invalid reads 
seems to be deterministic, though the pattern isn't obvious.
* An invalid read can be detected with {{ValidateFull()}} on the {{recordList}} 
 array. Here is the error message it yields:

{code}
Invalid: List child array invalid: Invalid: Struct child array #0 invalid: 
Invalid: null_count value (854) doesn't match actual number of nulls in array 
(861)
{code}

> [C++] [Parquet] FileReader returns inconsistent results on repeat reads
> -----------------------------------------------------------------------
>
>                 Key: ARROW-14047
>                 URL: https://issues.apache.org/jira/browse/ARROW-14047
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 5.0.0
>         Environment: Centos 7 gcc 9.2.0
>            Reporter: Radu Teodorescu
>            Assignee: Will Jones
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Capture.PNG, writeReadRowGroup.parquet
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We are seeing that for certain data sets when dealing with lists of structs, 
> repeated reads yield different results - I have a file that exhibits this 
> behavior and below is the code for reproducing it:
> {code:java}
>   filesystem::path filePath = dirPath / "writeReadRowGroup.parquet";
>   arrow::MemoryPool *pool = arrow::default_memory_pool();  
> std::shared_ptr<arrow::io::ReadableFile> infile;
>   PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, 
> pool));
>   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
>   auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader);
>   CHECK_OK(status);  std::shared_ptr<arrow::Schema> readSchema;
>   CHECK_OK(arrow_reader->GetSchema(&readSchema));
>   std::shared_ptr<arrow::Table> table;
>   std::vector<int> indicesToGet;
>   CHECK_OK(arrow_reader->ReadTable(&table));  auto recordListCol1 = 
> arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}),
>                                            
> {table->GetColumnByName("recordList")});  for (int i = 0; i < 20; ++i) {
>     cout << "data reread operation number = " + std::to_string(i) << endl;
>     std::shared_ptr<arrow::Table> table2;
>     CHECK_OK(arrow_reader->ReadTable(&table2));
>     auto recordListCol2 = 
> arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}),
>                                              
> {table2->GetColumnByName("recordList")});
>     bool equals = recordListCol1->Equals(*recordListCol2);
>     if (!equals) {
>       cout << recordListCol1->ToString() << endl;
>       cout << endl << "new table" << endl;
>       cout << recordListCol2->ToString() << endl;
>       throw std::runtime_error("Subsequent re-read failure ");
>     }  }
> {code}
> Apparently, as shown in the attached capture the state machine used to track 
> nulls is broken on subsequent usage
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to