[
https://issues.apache.org/jira/browse/PARQUET-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney moved ARROW-7376 to PARQUET-1766:
----------------------------------------------
Component/s: (was: C++)
parquet-cpp
Fix Version/s: (was: 0.16.0)
cpp-1.6.0
Key: PARQUET-1766 (was: ARROW-7376)
Affects Version/s: (was: 0.15.1)
Workflow: patch-available, re-open possible (was: jira)
Project: Parquet (was: Apache Arrow)
> [C++] parquet NaN/null double statistics can result in endless loop
> -------------------------------------------------------------------
>
> Key: PARQUET-1766
> URL: https://issues.apache.org/jira/browse/PARQUET-1766
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Pierre Belzile
> Assignee: Francois Saint-Jacques
> Priority: Critical
> Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> There is a bug in the doubles column statistics computation when writing to
> parquet an array with only NaNs and nulls. It loops endlessly if the last
> cell of a write group is a Null. The line in error is
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633]
> which checks for NaN but not for Null. Code then falls through and loops
> endlessly and causes the program to appear frozen.
> This code snippet repeats:
> {noformat}
> TEST(parquet, nans) {
> /* Create a small parquet structure */
> std::vector<std::shared_ptr<::arrow::Field>> fields;
> fields.push_back(::arrow::field("doubles", ::arrow::float64()));
> std::shared_ptr<::arrow::Schema> schema =
> ::arrow::schema(std::move(fields));
> std::unique_ptr<::arrow::RecordBatchBuilder> builder;
> ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(),
> &builder);
>
> builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits<double>::quiet_NaN());
> builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();
> std::shared_ptr<::arrow::RecordBatch> batch;
> builder->Flush(&batch);
> arrow::PrettyPrint(*batch, 0, &std::cout); std::shared_ptr<arrow::Table>
> table;
> arrow::Table::FromRecordBatches({batch}, &table); /* Attempt to write */
> std::shared_ptr<::arrow::io::FileOutputStream> os;
> arrow::io::FileOutputStream::Open("/tmp/test.parquet", &os);
> parquet::WriterProperties::Builder writer_props_bld;
> // writer_props_bld.disable_statistics("doubles");
> std::shared_ptr<parquet::WriterProperties> writer_props =
> writer_props_bld.build();
> std::shared_ptr<parquet::ArrowWriterProperties> arrow_props =
> parquet::ArrowWriterProperties::Builder().store_schema()->build();
> std::unique_ptr<parquet::arrow::FileWriter> writer;
> parquet::arrow::FileWriter::Open(
> *table->schema(), arrow::default_memory_pool(), os,
> writer_props, arrow_props, &writer);
> writer->WriteTable(*table, 1024);
> }{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)