pitrou commented on code in PR #39781:
URL: https://github.com/apache/arrow/pull/39781#discussion_r1480255098
##########
python/pyarrow/_parquet.pyx:
##########
@@ -849,6 +849,18 @@ cdef class FileMetaData(_Weakrefable):
cdef Buffer buffer = sink.getvalue()
return _reconstruct_filemetadata, (buffer,)
+ def __hash__(self):
+ def flatten(obj):
Review Comment:
But, even then, this approach will not be very performant while a hash is
supposed to compute quickly:
```python
>>> md =
pq.read_metadata("/home/antoine/arrow/dev/python/lineitem-snappy.pq")
>>> md
<pyarrow._parquet.FileMetaData object at 0x7f87befc19e0>
created_by: parquet-cpp-arrow version 15.0.0-SNAPSHOT
num_columns: 16
num_rows: 2568534
num_row_groups: 21
format_version: 2.6
serialized_size: 32440
>>> md.__reduce__()
(<cyfunction _reconstruct_filemetadata at 0x7f87c02fea80>,
(<pyarrow.Buffer address=0x7f88082086c0 size=32440 is_cpu=True
is_mutable=True>,))
>>> hash(md.__reduce__()[1][0].to_pybytes())
8779236492631648618
>>> %timeit hash(md.__reduce__()[1][0].to_pybytes())
93.6 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```
It will be far worse with this PR's approach, since generating the dict is
extremely costly:
```python
>>> %timeit md.to_dict()
2.95 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
(note: milliseconds, as opposed to microseconds above)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]