Re: [PR] [SPARK-46306][PS] Fix `LocIndexer` to work properly when the key is missing [spark]

via GitHub Thu, 07 Dec 2023 21:06:27 -0800


itholic commented on code in PR #44236:
URL: https://github.com/apache/spark/pull/44236#discussion_r1419930149



##########
python/pyspark/pandas/indexing.py:
##########
@@ -563,6 +563,16 @@ def __getitem__(self, key: Any) -> Union["Series", 
"DataFrame"]:
         else:
             psdf_or_psser = psdf
 
+        if isinstance(key, list):
+            result_index = psdf_or_psser.index
+            if len(key) != len(result_index):
+                # Since the result Index size is expected to be small,

Review Comment:
   Oh... I thought the maximum length of the `result_index` could not exceed 
the length of the given `key`, but on second thought the length could be very 
long when there are multiple of the same index value.
   
   Alternatively I think we can compare only the length of the unique value 
sets of `key` and `result_index`, and simply raise a `KeyError` if they are 
different.
   
   e.g.
   ```python
   if isinstance(key, list):
       result_index = psdf_or_psser.index
       # if number of unique value is different, some key(s) is(are) missing 
from Index.
       if len(set(key)) != len(result_index.drop_duplicates()):
           raise KeyError("There is a key that does not exist in the Index 
among the list of given keys.")
   ```
   
   However, this also could be expensive as it also calls `drop_duplicates()` 
and `len()` (But at least OOM will not occur).
   
   If you think these methods are still too expensive, I think maybe we can 
simply add a note that returns results excluding the given key if it does not 
exist in the Index, instead of raising a `KeyError` unlike Pandas.
   
   e.g.
   ```
   .. note:: When a key is given as a list, Pandas raises a `KeyError` when a 
specific key
       does not exist, but Pandas API on Spark simply returns a result 
excluding the key
       that does not exist to avoid performance degradation.
   ```
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46306][PS] Fix `LocIndexer` to work properly when the key is missing [spark]

Reply via email to