[
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-38844:
------------------------------------
Assignee: Apache Spark
> impl Series.interpolate and DataFrame.interpolate
> -------------------------------------------------
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 3.4.0
> Reporter: zhengruifeng
> Assignee: Apache Spark
> Priority: Major
>
> h2. Goal:
> [pandas's
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
> supports many methods, _linear_ is applied by default, other methods ( _pad_
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be
> implemented easily since scipy is used internally and the window frame used
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented
> in pandas API on spark via {_}fillna{_}, so this work currently focus on
> implementing the missing *linear interpolation*
> h2.
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added,
> one ( _null_index_ ) is to compute the indices of missing values in each
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
> (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
> * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) /
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_
> + _last_not_null_forward_
> * for the NaN at index(1), skip it due to the default *limit_direction* =
> _forward_
> * for the NaN at index(8), fill it like _ffill_ with vlaue
> _last_not_null_forward_
> * If _limit_ is set, then NaNs with _null_index_forward_ greater than
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]