pitrou commented on code in PR #39781:
URL: https://github.com/apache/arrow/pull/39781#discussion_r1480273615


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1071,6 +1083,11 @@ cdef class ParquetSchema(_Weakrefable):
     def __getitem__(self, i):
         return self.column(i)
 
+    def __hash__(self):
+        return hash(
+            (object.__hash__(self), frombytes(self.schema.ToString(), 
safe=True))

Review Comment:
   Two things:
   1. `object.__hash__` hashes by identity, which is precisely what you want to 
avoid here:
   ```python
   >>> s = md.schema.to_arrow_schema()
   >>> s2 = md.schema.to_arrow_schema()
   >>> object.__hash__(s)
   8763812462659
   >>> object.__hash__(s2)
   8763812583391
   ```
   2. `frombytes` is not necessary, you can just hash a bytes object directly, 
which should be slightly faster
   ```python
   >>> s.to_string()
   'l_orderkey: int32\nl_partkey: int32\nl_suppkey: int32\nl_linenumber: 
int32\nl_quantity: double\nl_extendedprice: double\nl_discount: double\nl_tax: 
double\nl_returnflag: string\nl_linestatus: string\nl_shipdate: 
timestamp[us]\nl_commitdate: timestamp[us]\nl_receiptdate: 
timestamp[us]\nl_shipinstruct: string\nl_shipmode: string\nl_comment: string'
   >>> b = s.to_string().encode()
   >>> %timeit hash(b)
   56.4 ns ± 3.83 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops 
each)
   >>> %timeit hash(b.decode())
   242 ns ± 2.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to