wgtmac commented on code in PR #45459:
URL: https://github.com/apache/arrow/pull/45459#discussion_r2014481983


##########
cpp/src/parquet/geospatial_statistics.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cmath>
+#include <cstdint>
+#include <memory>
+
+#include "parquet/platform.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// \brief Structure represented encoded statistics to be written to and read 
from Parquet
+/// serialized metadata.
+///
+/// See the Parquet Thrift definition and GeoStatistics for the specific 
definition
+/// of field values.
+struct PARQUET_EXPORT EncodedGeoStatistics {
+  static constexpr double kInf = std::numeric_limits<double>::infinity();
+
+  double xmin{kInf};
+  double xmax{-kInf};
+  double ymin{kInf};
+  double ymax{-kInf};
+  double zmin{kInf};
+  double zmax{-kInf};
+  double mmin{kInf};
+  double mmax{-kInf};
+  std::vector<int32_t> geospatial_types;
+
+  bool has_x() const { return !std::isinf(xmin - xmax); }
+  bool has_y() const { return !std::isinf(ymin - ymax); }
+  bool has_z() const { return !std::isinf(zmin - zmax); }
+  bool has_m() const { return !std::isinf(mmin - mmax); }
+
+  bool is_set() const {
+    return !geospatial_types.empty() || has_x() || has_y() || has_z() || 
has_m();
+  }
+};
+
+class GeoStatisticsImpl;
+
+/// \brief Base type for computing geospatial column statistics while writing 
a file
+/// or representing them when reading a file
+///
+/// Note that NaN values that were encountered within coordinates are omitted; 
however,
+/// NaN values that were obtained via decoding encoded statistics are 
propagated. This
+/// behaviour ensures C++ clients that are inspecting statistics via the 
column metadata
+/// can detect the case where a writer generated NaNs (even though this 
implementation

Review Comment:
   > where a writer generated NaNs
   
   If that happens, is the bbox still useful for PPD? Or we can only drop the 
stats?



##########
cpp/src/parquet/CMakeLists.txt:
##########
@@ -259,6 +262,12 @@ endif()
 if(NOT PARQUET_MINIMAL_DEPENDENCY)
   list(APPEND PARQUET_SHARED_LINK_LIBS arrow_shared)
 
+  # TODO(paleolimbot): Make sure this is OK or remove if not!
+  if(ARROW_JSON)

Review Comment:
   I agree that making the change as minimal as possible is also important. I'm 
fine with this. @pitrou @emkornfield WDYT?



##########
cpp/src/parquet/geospatial_statistics.cc:
##########
@@ -0,0 +1,363 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/geospatial_statistics.h"

Review Comment:
   ```suggestion
   #include "parquet/geospatial_statistics.h"
   
   ```



##########
cpp/src/parquet/geospatial_util_internal.cc:
##########
@@ -0,0 +1,237 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/geospatial_util_internal.h"
+
+#include "arrow/result.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+
+namespace parquet::geometry {
+
+/// \brief Object to keep track of the low-level consumption of a well-known 
binary
+/// geometry
+///
+/// Briefly, ISO well-known binary supported by the Parquet spec is an endian 
byte
+/// (0x01 or 0x00), followed by geometry type + dimensions encoded as a 
(uint32_t),
+/// followed by geometry-specific data. Coordinate sequences are represented 
by a
+/// uint32_t (the number of coordinates) plus a sequence of doubles (number of 
coordinates
+/// multiplied by the number of dimensions).
+class WKBBuffer {
+ public:
+  WKBBuffer() : data_(NULLPTR), size_(0) {}

Review Comment:
   ```suggestion
     WKBBuffer() : data_(nullptr), size_(0) {}
   ```
   
   We only need to use NULLPTR in the header files.



##########
cpp/src/parquet/geospatial_util_internal.cc:
##########
@@ -0,0 +1,237 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/geospatial_util_internal.h"
+
+#include "arrow/result.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+
+namespace parquet::geometry {
+
+/// \brief Object to keep track of the low-level consumption of a well-known 
binary
+/// geometry
+///
+/// Briefly, ISO well-known binary supported by the Parquet spec is an endian 
byte
+/// (0x01 or 0x00), followed by geometry type + dimensions encoded as a 
(uint32_t),
+/// followed by geometry-specific data. Coordinate sequences are represented 
by a
+/// uint32_t (the number of coordinates) plus a sequence of doubles (number of 
coordinates
+/// multiplied by the number of dimensions).
+class WKBBuffer {
+ public:
+  WKBBuffer() : data_(NULLPTR), size_(0) {}
+  WKBBuffer(const uint8_t* data, int64_t size) : data_(data), size_(size) {}
+
+  ::arrow::Result<uint8_t> ReadUInt8() { return ReadChecked<uint8_t>(); }
+
+  ::arrow::Result<uint32_t> ReadUInt32(bool swap) {
+    ARROW_ASSIGN_OR_RAISE(auto value, ReadChecked<uint32_t>());
+    if (ARROW_PREDICT_FALSE(swap)) {
+      return ByteSwap(value);
+    } else {
+      return value;
+    }
+  }
+
+  template <typename Coord, typename Visit>
+  ::arrow::Status ReadDoubles(uint32_t n_coords, bool swap, Visit&& visit) {
+    size_t total_bytes = n_coords * sizeof(Coord);
+    if (size_ < total_bytes) {
+      return ::arrow::Status::SerializationError(
+          "Can't coordinate sequence of ", total_bytes, " bytes from WKBBuffer 
with ",

Review Comment:
   ```suggestion
             "Can't read coordinate sequence of ", total_bytes, " bytes from 
WKBBuffer with ",
   ```



##########
cpp/src/parquet/column_writer_test.cc:
##########
@@ -1849,8 +1867,130 @@ TEST_F(TestValuesWriterInt32Type, 
AvoidCompressedInDataPageV2) {
     verify_only_one_uncompressed_page(/*total_num_values=*/1);
   }
 }
-
 #endif
 
+// Test writing and reading geometry columns
+class TestGeometryValuesWriter : public TestPrimitiveWriter<ByteArrayType> {
+ public:
+  void SetUpSchema(Repetition::type repetition, int num_columns) override {
+    std::vector<schema::NodePtr> fields;
+
+    for (int i = 0; i < num_columns; ++i) {
+      std::string name = TestColumnName(i);
+      std::shared_ptr<const LogicalType> logical_type =
+          GeometryLogicalType::Make("srid:1234");
+      fields.push_back(schema::PrimitiveNode::Make(name, repetition, 
logical_type,
+                                                   ByteArrayType::type_num));
+    }
+    node_ = schema::GroupNode::Make("schema", Repetition::REQUIRED, fields);
+    schema_.Init(node_);
+  }
+
+  void GenerateData(int64_t num_values, uint32_t seed = 0) {
+    values_.resize(num_values);
+
+    buffer_.resize(num_values * kWkbPointXYSize);
+    uint8_t* ptr = buffer_.data();
+    for (int k = 0; k < num_values; k++) {
+      std::string item = test::MakeWKBPoint(
+          {static_cast<double>(k), static_cast<double>(k + 1)}, false, false);
+      std::memcpy(ptr, item.data(), item.size());
+      values_[k].len = kWkbPointXYSize;
+      values_[k].ptr = ptr;
+      ptr += kWkbPointXYSize;
+    }
+
+    values_ptr_ = values_.data();
+  }
+};
+
+TEST_F(TestGeometryValuesWriter, TestWriteAndRead) {
+  this->SetUpSchema(Repetition::REQUIRED, 1);
+  this->GenerateData(SMALL_SIZE);
+  size_t num_values = this->values_.size();
+  auto writer = this->BuildWriter(num_values, ColumnProperties());
+  writer->WriteBatch(this->values_.size(), nullptr, nullptr, 
this->values_.data());
+
+  writer->Close();
+  this->ReadColumn();
+  for (size_t i = 0; i < num_values; i++) {
+    const ByteArray& value = this->values_out_[i];
+    auto xy = GetWKBPointCoordinateXY(value);
+    EXPECT_TRUE(xy.has_value());
+    auto expected_x = static_cast<double>(i);
+    auto expected_y = static_cast<double>(i + 1);
+    EXPECT_EQ(*xy, (std::pair<double, double>(expected_x, expected_y)));
+  }
+
+  ASSERT_TRUE(metadata_accessor()->is_geo_stats_set());
+  std::shared_ptr<GeoStatistics> geospatial_statistics = metadata_geo_stats();

Review Comment:
   nit: check that regular statistics is disabled due to unknown sort order.



##########
cpp/src/parquet/test_util.cc:
##########
@@ -194,5 +194,84 @@ void prefixed_random_byte_array(int n, uint32_t seed, 
uint8_t* buf, int len, FLB
   }
 }
 
+namespace {
+
+uint32_t GeometryTypeToWKB(geometry::GeometryType geometry_type, bool has_z, 
bool has_m) {
+  auto wkb_geom_type = static_cast<uint32_t>(geometry_type);
+
+  if (has_z) {
+    wkb_geom_type += 1000;
+  }
+
+  if (has_m) {
+    wkb_geom_type += 2000;
+  }
+
+  return wkb_geom_type;
+}
+
+}  // namespace
+
+std::string MakeWKBPoint(const std::vector<double>& xyzm, bool has_z, bool 
has_m) {
+  // 1:endianness + 4:type + 8:x + 8:y
+  int num_bytes =
+      kWkbPointXYSize + (has_z ? sizeof(double) : 0) + (has_m ? sizeof(double) 
: 0);
+  std::string wkb(num_bytes, 0);
+  char* ptr = wkb.data();
+
+  ptr[0] = kWkbNativeEndianness;
+  uint32_t geom_type = GeometryTypeToWKB(geometry::GeometryType::kPoint, 
has_z, has_m);
+  std::memcpy(&ptr[1], &geom_type, 4);
+  std::memcpy(&ptr[5], &xyzm[0], 8);
+  std::memcpy(&ptr[13], &xyzm[1], 8);
+  ptr += 21;
+
+  if (has_z) {
+    std::memcpy(ptr, &xyzm[2], 8);
+    ptr += 8;
+  }
+
+  if (has_m) {
+    std::memcpy(ptr, &xyzm[3], 8);
+  }
+
+  return wkb;
+}
+
+std::optional<std::pair<double, double>> GetWKBPointCoordinateXY(const 
ByteArray& value) {
+  if (value.len != kWkbPointXYSize) {
+    return std::nullopt;
+  }
+
+  if (value.ptr[0] != kWkbNativeEndianness) {
+    return std::nullopt;
+  }
+
+  uint32_t expected_geom_type =
+      GeometryTypeToWKB(geometry::GeometryType::kPoint, false, false);

Review Comment:
   ```suggestion
         GeometryTypeToWKB(geometry::GeometryType::kPoint, /*has_z=*/false, 
/*has_m=*/false);
   ```



##########
cpp/src/parquet/geospatial_statistics.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cmath>
+#include <cstdint>
+#include <memory>
+
+#include "parquet/platform.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// \brief Structure represented encoded statistics to be written to and read 
from Parquet
+/// serialized metadata.
+///
+/// See the Parquet Thrift definition and GeoStatistics for the specific 
definition
+/// of field values.
+struct PARQUET_EXPORT EncodedGeoStatistics {
+  static constexpr double kInf = std::numeric_limits<double>::infinity();
+
+  double xmin{kInf};
+  double xmax{-kInf};
+  double ymin{kInf};
+  double ymax{-kInf};
+  double zmin{kInf};
+  double zmax{-kInf};
+  double mmin{kInf};
+  double mmax{-kInf};
+  std::vector<int32_t> geospatial_types;
+
+  bool has_x() const { return !std::isinf(xmin - xmax); }
+  bool has_y() const { return !std::isinf(ymin - ymax); }
+  bool has_z() const { return !std::isinf(zmin - zmax); }
+  bool has_m() const { return !std::isinf(mmin - mmax); }
+
+  bool is_set() const {
+    return !geospatial_types.empty() || has_x() || has_y() || has_z() || 
has_m();
+  }
+};
+
+class GeoStatisticsImpl;
+
+/// \brief Base type for computing geospatial column statistics while writing 
a file
+/// or representing them when reading a file
+///
+/// Note that NaN values that were encountered within coordinates are omitted; 
however,
+/// NaN values that were obtained via decoding encoded statistics are 
propagated. This
+/// behaviour ensures C++ clients that are inspecting statistics via the 
column metadata
+/// can detect the case where a writer generated NaNs (even though this 
implementation
+/// does not generate them).
+///
+/// The handling of NaN values in coordinates is not well-defined among 
bounding
+/// implementations except for the WKB convention for POINT EMPTY, which is 
consistently
+/// represented as a point whose ordinates are all NaN. Any other geometry 
that contains
+/// NaNs cannot expect defined behaviour here or elsewhere; however, a row 
group that
+/// contains both NaN-containing and normal (completely finite) geometries 
should not be
+/// excluded from predicate pushdown.
+///
+/// EXPERIMENTAL
+class PARQUET_EXPORT GeoStatistics {
+ public:
+  GeoStatistics();
+  explicit GeoStatistics(const EncodedGeoStatistics& encoded);
+
+  ~GeoStatistics();
+
+  /// \brief Return true if bounds, geometry types, and validity are identical
+  bool Equals(const GeoStatistics& other) const;
+
+  /// \brief Update these statistics based on previously calculated or decoded 
statistics
+  void Merge(const GeoStatistics& other);
+
+  /// \brief Update these statistics based on values
+  void Update(const ByteArray* values, int64_t num_values);
+
+  /// \brief Update these statistics based on the non-null elements of values
+  void UpdateSpaced(const ByteArray* values, const uint8_t* valid_bits,
+                    int64_t valid_bits_offset, int64_t num_spaced_values,
+                    int64_t num_values);
+
+  /// \brief Update these statistics based on the non-null elements of values
+  ///
+  /// Currently, BinaryArray and LargeBinaryArray input is supported.
+  void Update(const ::arrow::Array& values);
+
+  /// \brief Return these statistics to an empty state
+  void Reset();
+
+  /// \brief Encode the statistics for serializing to Thrift
+  ///
+  /// If invalid WKB was encountered, empty encoded statistics are returned
+  /// (such that is_set() returns false and they should not be written).
+  EncodedGeoStatistics Encode() const;
+
+  /// \brief Returns true if all WKB encountered was valid or false otherwise
+  bool is_valid() const;
+
+  /// \brief Reset existing statistics and populate them from 
previously-encoded ones
+  void Decode(const EncodedGeoStatistics& encoded);
+
+  /// \brief The minimum encountered value in the X dimension, or Inf if no 
non-NaN X
+  /// values were encountered.
+  ///
+  /// The Parquet definition allows for "wrap around" bounds where xmin > 
xmax. In this
+  /// case, these bounds represent the union of the intervals [xmax, Inf] and 
[-Inf,
+  /// xmin]. This implementation does not yet generate these types of bounds 
but they may
+  /// be encountered in files written by other writers.
+  double xmin() const;

Review Comment:
   Same as my question above: if we don't support wraparound, is the bbox still 
useful?



##########
cpp/src/parquet/CMakeLists.txt:
##########
@@ -171,6 +171,9 @@ set(PARQUET_SRCS
     exception.cc
     file_reader.cc
     file_writer.cc
+    geospatial_statistics.cc
+    geospatial_util_internal.cc
+    geospatial_util_internal_json.cc

Review Comment:
   ```suggestion
       geospatial_util_json_internal.cc
   ```
   
   Please rename the header file as well to avoid being installed publicly by 
accident.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to