Re: [PR] GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema [arrow]

via GitHub Tue, 06 Feb 2024 09:02:21 -0800


pitrou commented on code in PR #39781:
URL: https://github.com/apache/arrow/pull/39781#discussion_r1480255098



##########
python/pyarrow/_parquet.pyx:
##########
@@ -849,6 +849,18 @@ cdef class FileMetaData(_Weakrefable):
         cdef Buffer buffer = sink.getvalue()
         return _reconstruct_filemetadata, (buffer,)
 
+    def __hash__(self):
+        def flatten(obj):

Review Comment:
   But, even then, this approach will not be very performant while a hash is 
supposed to compute quickly:
   ```python
   >>> md = 
pq.read_metadata("/home/antoine/arrow/dev/python/lineitem-snappy.pq")
   >>> md
   <pyarrow._parquet.FileMetaData object at 0x7f87befc19e0>
     created_by: parquet-cpp-arrow version 15.0.0-SNAPSHOT
     num_columns: 16
     num_rows: 2568534
     num_row_groups: 21
     format_version: 2.6
     serialized_size: 32440
   >>> md.__reduce__()
   (<cyfunction _reconstruct_filemetadata at 0x7f87c02fea80>,
    (<pyarrow.Buffer address=0x7f88082086c0 size=32440 is_cpu=True 
is_mutable=True>,))
   >>> hash(md.__reduce__()[1][0].to_pybytes())
   8779236492631648618
   >>> %timeit hash(md.__reduce__()[1][0].to_pybytes())
   93.6 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   ```
   
   It will be far worse with this PR's approach, since generating the dict is 
extremely costly:
   ```python
   >>> %timeit md.to_dict()
   2.95 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   ```
   
   (note: milliseconds, as opposed to microseconds above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema [arrow]

Reply via email to