itholic commented on code in PR #44236:
URL: https://github.com/apache/spark/pull/44236#discussion_r1419930149
##########
python/pyspark/pandas/indexing.py:
##########
@@ -563,6 +563,16 @@ def __getitem__(self, key: Any) -> Union["Series",
"DataFrame"]:
else:
psdf_or_psser = psdf
+ if isinstance(key, list):
+ result_index = psdf_or_psser.index
+ if len(key) != len(result_index):
+ # Since the result Index size is expected to be small,
Review Comment:
Oh... I thought the maximum length of the `result_index` could not exceed
the length of the given `key`, but on second thought the length could be very
long when there are multiple of the same index value.
Alternatively I think we can compare only the length of the unique value
sets of `key` and `result_index`, and simply raise a `KeyError` if they are
different.
e.g.
```python
if isinstance(key, list):
result_index = psdf_or_psser.index
# if number of unique value is different, some key(s) is(are) missing
from Index.
if len(set(key)) != len(result_index.drop_duplicates()):
raise KeyError("There is a key that does not exist in the Index
among the list of given keys.")
```
However, this also could be expensive as it calls `drop_duplicates()` and
`len()` (But at least OOM will not occur).
If you think these methods are still too expensive, I think maybe we can
simply add a note that returns results excluding the given key if it does not
exist in the Index, instead of raising a `KeyError` unlike Pandas.
e.g.
```
.. note:: When a key is given as a list, Pandas raises a `KeyError` when a
specific key
does not exist, but Pandas API on Spark simply returns a result
excluding the key
that does not exist to avoid performance degradation.
```
WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]