pitrou commented on code in PR #39781:
URL: https://github.com/apache/arrow/pull/39781#discussion_r1480266828
##########
python/pyarrow/_parquet.pyx:
##########
@@ -849,6 +849,18 @@ cdef class FileMetaData(_Weakrefable):
cdef Buffer buffer = sink.getvalue()
return _reconstruct_filemetadata, (buffer,)
+ def __hash__(self):
+ def flatten(obj):
Review Comment:
As a general basis, a `__hash__` does not need to take all contents
exhaustively into account if doing so is too expensive. It's ok to limit
ourselves to a smaller, faster subset. For example only the schema:
```python
>>> %timeit hash(repr(md.schema))
6.59 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
```
You can also convert to a Arrow schema:
```python
>>> %timeit md.schema.to_arrow_schema()
4.94 µs ± 8.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
````
Unfortunately, `pa.Schema.__hash__` is too slow currently:
```python
>>> s = md.schema.to_arrow_schema()
>>> %timeit hash(s)
12.7 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
```
This is because `pa.Schema.__hash__` takes the [slow Python
way](https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/types.pxi#L2524-L2525),
while the C++ Schema has a useful and fast `fingerprint` method. (same comment
for `pa.Field` and `pa.DataType)
Of course you don't have to do all this here, but a `__hash__` function that
takes milliseconds to execute is definitely not a good idea.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]