pitrou commented on code in PR #15125: URL: https://github.com/apache/arrow/pull/15125#discussion_r1060707883
########## cpp/src/arrow/dataset/file_parquet.cc: ########## @@ -98,6 +98,17 @@ Result<std::shared_ptr<SchemaManifest>> GetSchemaManifest( return manifest; } +bool isNan(const Scalar& value) { Review Comment: Please, let's follow the coding conventions ```suggestion bool IsNan(const Scalar& value) { ``` ########## cpp/src/arrow/dataset/file_parquet.cc: ########## @@ -98,6 +98,17 @@ Result<std::shared_ptr<SchemaManifest>> GetSchemaManifest( return manifest; } +bool isNan(const Scalar& value) { + if (value.type->Equals(*float32())) { + const FloatScalar& float_scalar = checked_cast<const FloatScalar&>(value); + return std::isnan(float_scalar.value); Review Comment: Do we know for sure that `float_scalar.valid` is true? ########## cpp/src/arrow/dataset/file_parquet.cc: ########## @@ -144,17 +159,23 @@ std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( return compute::or_(std::move(single_value), is_null(std::move(field_expr))); } - auto lower_bound = - compute::greater_equal(field_expr, compute::literal(std::move(min))); - auto upper_bound = compute::less_equal(field_expr, compute::literal(std::move(max))); + auto lower_bound = compute::greater_equal(field_expr, compute::literal(min)); Review Comment: Perhaps the code could be better organized, to that NaN handling happens in a single place? (note that `min->Equals(max)` can only be false if there's a NaN...) ########## cpp/src/arrow/dataset/file_parquet.cc: ########## @@ -98,6 +98,17 @@ Result<std::shared_ptr<SchemaManifest>> GetSchemaManifest( return manifest; } +bool isNan(const Scalar& value) { + if (value.type->Equals(*float32())) { Review Comment: Doing a full type equality check is unnecessarily slow here ```suggestion if (value.type->id() == Type::FLOAT) { ``` ########## cpp/src/arrow/dataset/file_parquet.cc: ########## @@ -135,6 +146,10 @@ std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( min = maybe_min.MoveValueUnsafe(); max = maybe_max.MoveValueUnsafe(); + if (isNan(*min) && isNan(*max)) { Review Comment: Please add a comment explaining the reasons why NaN needs to be handled explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org