[GitHub] [spark] Yikun commented on a change in pull request #35191: [SPARK-37491][PYTHON]Fix Series.asof for unsorted values

GitBox Wed, 02 Mar 2022 19:35:49 -0800


Yikun commented on a change in pull request #35191:
URL: https://github.com/apache/spark/pull/35191#discussion_r818277484




##########
File path: python/pyspark/pandas/series.py
##########
@@ -5228,10 +5229,22 @@ def asof(self, where: Union[Any, List]) -> 
Union[Scalar, "Series"]:
             where = [where]
         index_scol = self._internal.index_spark_columns[0]
         index_type = self._internal.spark_type_for(index_scol)
+
+        if np.nan in where:
+            # When `where` is np.nan, pandas returns the last index value.
+            max_index = 
self._internal.spark_frame.select(F.last(index_scol)).take(1)[0][0]
+            modified_where = [max_index if x is np.nan else x for x in where]
+        else:
+            modified_where = where
+
         cond = [
-            F.max(F.when(index_scol <= SF.lit(index).cast(index_type), 
self.spark.column))
-            for index in where
+            F.last(
+                F.when(index_scol <= SF.lit(index).cast(index_type), 
self.spark.column),
+                ignorenulls=True,
+            )
+            for idx, index in enumerate(modified_where)
         ]
+

Review comment:
       nits: unrelated new line

##########
File path: python/pyspark/pandas/series.py
##########
@@ -5228,10 +5229,22 @@ def asof(self, where: Union[Any, List]) -> 
Union[Scalar, "Series"]:
             where = [where]
         index_scol = self._internal.index_spark_columns[0]
         index_type = self._internal.spark_type_for(index_scol)
+
+        if np.nan in where:
+            # When `where` is np.nan, pandas returns the last index value.
+            max_index = 
self._internal.spark_frame.select(F.last(index_scol)).take(1)[0][0]

Review comment:
       nits:  maybe `last_index`
   
   `max` is a little bit confused in here.

##########
File path: python/pyspark/pandas/series.py
##########
@@ -5228,10 +5229,22 @@ def asof(self, where: Union[Any, List]) -> 
Union[Scalar, "Series"]:
             where = [where]
         index_scol = self._internal.index_spark_columns[0]
         index_type = self._internal.spark_type_for(index_scol)
+
+        if np.nan in where:
+            # When `where` is np.nan, pandas returns the last index value.
+            max_index = 
self._internal.spark_frame.select(F.last(index_scol)).take(1)[0][0]
+            modified_where = [max_index if x is np.nan else x for x in where]

Review comment:
       I'm ok with this.
   
   - The job only triggered here when `np.nan in where` corner case, no more 
influence on normal case.
   - Consider the size of `where` is not huge, the cost of `np.nan in where` is 
okay in here
   - By this, we can follow the pandas behavior.
   
   If we still think futher and make it best, we might using 
`self.index.values[-1]` or `self.index.to_numpy()[-1]` according to the size of 
index (`self.index.size`),  it can speed up in smaller data size. But I'm not 
sure it's necessary or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a change in pull request #35191: [SPARK-37491][PYTHON]Fix Series.asof for unsorted values

Reply via email to