pitrou commented on code in PR #41580:
URL: https://github.com/apache/arrow/pull/41580#discussion_r1598646369


##########
cpp/src/parquet/printer.cc:
##########
@@ -64,6 +64,25 @@ void PrintPageEncodingStats(std::ostream& stream,
 // the fixed initial size is just for an example
 #define COL_WIDTH 30
 
+void Put(std::ostream& stream, char c, int n) {
+  for (int i = 0; i < n; ++i) {
+    stream.put(c);
+  }
+}
+
+void PrintKeyValueMetadata(std::ostream& stream,
+                           const KeyValueMetadata& key_value_metadata,
+                           int indent_level = 0, int indent_width = 1) {

Review Comment:
   Seems like it would be easier to just pass the indentation string?
   ```suggestion
                              std::string_view indent = "") {
   ```



##########
cpp/src/parquet/column_writer.h:
##########
@@ -29,6 +29,7 @@
 namespace arrow {
 
 class Array;
+class KeyValueMetadata;

Review Comment:
   Can you please remove this and instead include `arrow/type_fwd.h`?



##########
cpp/src/parquet/column_writer.h:
##########
@@ -181,6 +184,10 @@ class PARQUET_EXPORT ColumnWriter {
   /// \brief The file-level writer properties
   virtual const WriterProperties* properties() = 0;
 
+  /// \brief Return KeyValueMetadata that can be used to store key-value
+  /// metadata in ColumnChunkMetaData.
+  virtual KeyValueMetadata& key_value_metadata() = 0;

Review Comment:
   +1.



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -772,3 +772,10 @@ def test_write_metadata_fs_file_combinations(tempdir, 
s3_example_s3fs):
     assert meta1.read_bytes() == meta2.read_bytes() \
         == meta3.read_bytes() == meta4.read_bytes() \
         == s3_fs.open(meta5).read()
+
+
+def test_column_chunk_key_value_metadata(datadir):
+    metadata = pq.read_metadata(datadir / 
'column-chunk-key-value-metadata.parquet')
+    key_value_metadata = metadata.row_group(0).column(0).metadata
+    print(key_value_metadata)

Review Comment:
   Can you please remove this print() call (which I assume was meant for 
debugging)?



##########
cpp/src/parquet/metadata.h:
##########
@@ -452,8 +453,12 @@ class PARQUET_EXPORT ColumnChunkMetaDataBuilder {
   // column chunk
   // Used when a dataset is spread across multiple files
   void set_file_path(const std::string& path);
+
   // column metadata
   void SetStatistics(const EncodedStatistics& stats);
+
+  KeyValueMetadata& key_value_metadata();

Review Comment:
   Please let's avoid non-const references, as per our C++ conventions. 
Instead, you could follow the current builder pattern:
   ```suggestion
     void SetKeyValueMetadata(const std::shared_ptr<const KeyValueMetadata>&);
   ```



##########
cpp/src/parquet/column_writer.h:
##########
@@ -53,6 +54,8 @@ class Encryptor;
 class OffsetIndexBuilder;
 class WriterProperties;
 
+using KeyValueMetadata = ::arrow::KeyValueMetadata;

Review Comment:
   It's generally better to avoid `using` directives in non-internal `.h` files.



##########
python/pyarrow/_parquet.pyx:
##########
@@ -507,6 +507,19 @@ cdef class ColumnChunkMetaData(_Weakrefable):
         """Whether the column chunk has a column index"""
         return self.metadata.GetColumnIndexLocation().has_value()
 
+    @property
+    def metadata(self):
+        """Additional metadata as key value pairs (dict[bytes, bytes])."""
+        cdef:
+            unordered_map[c_string, c_string] metadata
+            const CKeyValueMetadata* underlying_metadata
+        underlying_metadata = self.metadata.key_value_metadata().get()
+        if underlying_metadata != NULL:
+            underlying_metadata.ToUnorderedMap(&metadata)
+            return metadata
+        else:
+            return None

Review Comment:
   Should use `pyarrow_wrap_metadata` instead. Something like (untested):
   ```suggestion
           wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
           if wrapped is not None:
               return wrapped.to_dict()
           else:
               return wrapped
   ```



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -772,3 +772,10 @@ def test_write_metadata_fs_file_combinations(tempdir, 
s3_example_s3fs):
     assert meta1.read_bytes() == meta2.read_bytes() \
         == meta3.read_bytes() == meta4.read_bytes() \
         == s3_fs.open(meta5).read()
+
+
+def test_column_chunk_key_value_metadata(datadir):
+    metadata = pq.read_metadata(datadir / 
'column-chunk-key-value-metadata.parquet')
+    key_value_metadata = metadata.row_group(0).column(0).metadata
+    print(key_value_metadata)
+    assert key_value_metadata[b'foo'] == b'bar'

Review Comment:
   1. Can you test the entire dict contents?
   2. Can you add a test where the metadata is `None`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to