pitrou commented on code in PR #39781:
URL: https://github.com/apache/arrow/pull/39781#discussion_r1480266828


##########
python/pyarrow/_parquet.pyx:
##########
@@ -849,6 +849,18 @@ cdef class FileMetaData(_Weakrefable):
         cdef Buffer buffer = sink.getvalue()
         return _reconstruct_filemetadata, (buffer,)
 
+    def __hash__(self):
+        def flatten(obj):

Review Comment:
   As a general basis, a `__hash__` does not need to take all contents 
exhaustively into account if doing so is too expensive. It's ok to limit 
ourselves to a smaller, faster subset. For example only the schema:
   ```python
   >>> %timeit hash(repr(md.schema))
   6.59 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   ```
   
   You can also convert to a Arrow schema:
   ```python
   >>> %timeit md.schema.to_arrow_schema()
   4.94 µs ± 8.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   ````
   
   Unfortunately, `pa.Schema.__hash__` is too slow currently:
   ```python
   >>> s = md.schema.to_arrow_schema()
   >>> %timeit hash(s)
   12.7 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   ```
   
   This is because `pa.Schema.__hash__` takes the [slow Python 
way](https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/types.pxi#L2524-L2525),
 while the C++ Schema has a useful and fast `fingerprint` method. (same comment 
for `pa.Field` and `pa.DataType)
   
   Of course you don't have to do all this here, but a `__hash__` function that 
takes milliseconds to execute is definitely not a good idea. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to