[ 
https://issues.apache.org/jira/browse/BEAM-12550?focusedWorklogId=673695&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-673695
 ]

ASF GitHub Bot logged work on BEAM-12550:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Nov/21 18:26
            Start Date: 02/Nov/21 18:26
    Worklog Time Spent: 10m 
      Work Description: svetakvsundhar commented on a change in pull request 
#15809:
URL: https://github.com/apache/beam/pull/15809#discussion_r741348892



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -1430,6 +1430,72 @@ def corr(self, other, method, min_periods):
               [self._expr, other._expr],
               requires_partition_by=partitionings.Singleton(reason=reason)))
 
+  @frame_base.with_docs_from(pd.Series)
+  @frame_base.args_to_kwargs(pd.Series)
+  @frame_base.populate_defaults(pd.Series)
+  def skew(self, axis, skipna, level, numeric_only, **kwargs):
+    if level is not None:
+      raise NotImplementedError("per-level aggregation")
+    if skipna is None or skipna:
+      self = self.dropna()  # pylint: disable=self-cls-assignment
+    # See the online, numerically stable formulae at
+    # 
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
+    def compute_moments(x):
+      n = len(x)
+      if n == 0:
+        m, s, third_moment = 0, 0, 0
+      elif n < 3:
+        m = x.std(ddof=0)**2 * n
+        s = x.sum()
+        third_moment = (((x - x.mean())**3).sum())
+      else:
+        m = x.std(ddof=0)**2 * n
+        s = x.sum()
+        third_moment = (((x - x.mean())**3).sum())
+      return pd.DataFrame(
+          dict(m=[m], s=[s], n=[n], third_moment=[third_moment]))
+
+    def combine_moments(data):
+      m = s = n = third_moment = 0.0
+      for datum in data.itertuples():
+        if datum.n == 0:
+          continue
+        elif n == 0:
+          m, s, n, third_moment = datum.m, datum.s, datum.n, datum.third_moment
+        else:
+          mean_b = s / n
+          mean_a = datum.s / datum.n
+          delta = mean_b - mean_a
+          n_a = datum.n
+          n_b = n
+          combined_n = n + datum.n
+          third_moment += datum.third_moment + (
+              (delta**3 * ((n_a * n_b) * (n_a - n_b)) / ((combined_n)**2)) +
+              ((3 * delta) * ((n_a * m) - (n_b * datum.m)) / (combined_n)))
+          m += datum.m + delta**2 * n * datum.n / (n + datum.n)
+          s += datum.s
+          n += datum.n
+
+      if n < 3:
+        return float('nan')
+      elif m == 0:
+        return float(0)

Review comment:
       whoops, yeah disregard that statement, it doesn't hold for unbias skew.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 673695)
    Time Spent: 3h 10m  (was: 3h)

> Implement parallelizable skew and kurtosis 
> -------------------------------------------
>
>                 Key: BEAM-12550
>                 URL: https://issues.apache.org/jira/browse/BEAM-12550
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> skew and kurtosis should be parallelizable/lifftable by using a similar 
> [approach as std and 
> var|https://github.com/apache/beam/blob/a0f5e932d8a9aa491b16361abdc629b5e9a483f6/sdks/python/apache_beam/dataframe/frames.py#L1307-L1310].
>  See 
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
> which has information on extending that approach to calculating the third and 
> fourth central moments, needed for skew and kurtosis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to