[ 
https://issues.apache.org/jira/browse/BEAM-12550?focusedWorklogId=677973&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-677973
 ]

ASF GitHub Bot logged work on BEAM-12550:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Nov/21 00:24
            Start Date: 06/Nov/21 00:24
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on pull request #15909:
URL: https://github.com/apache/beam/pull/15909#issuecomment-962291253


   The pandas/_libs/testing.pyx errors look like real errors in 
[`test_dataframe_agg_method`](https://github.com/apache/beam/blob/a3bb58dbd4fc6a59f93f69f9ab1980f8232b6e82/sdks/python/apache_beam/dataframe/frames_test.py#L1490):
   ```
   >   ???
   E   AssertionError: Series are different
   E   
   E   Series values are different (50.0 %)
   E   [index]: [A, B]
   E   [left]:  [-1.1999999999999993, -1.699511634587763]
   E   [right]: [-1.200000000000001, -0.40130739795918835]
   ```
   They just happen to come from `pd.testing.assert_frames_equal`, which we use 
to verify if DataFrame results are equivalent:  
https://github.com/apache/beam/blob/a3bb58dbd4fc6a59f93f69f9ab1980f8232b6e82/sdks/python/apache_beam/dataframe/frames_test.py#L175-L176
   
   I also see a couple of failures for `test_series_cov_corr` indicating it may 
be a little flaky, like [this 
one](https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/20405/testReport/junit/apache_beam.dataframe.frames_test/DeferredFrameTest/test_series_cov_corr_8/):
   
   ```
   apache_beam/dataframe/frames_test.py:191: in _run_test
       self.assertTrue(
   E   AssertionError: False is not true : Expected:
   E   
   E   -1.2
   E   
   E   Actual:
   E   
   E   -1.1999545602598247
   ```
   
   That's off by just 5e-5, but I guess it's enough for np.isclose to consider 
it different. If we can rule out an actual cause for this difference, we may 
want to plumb through an option for increasing the tolerance, like we discussed 
for skew. But it seems like something else may be going on here.
   
   The error in `test_dataframe_agg_method` does look like a hard failure, if 
you look 
[here](https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/20405/testReport/junit/apache_beam.dataframe.frames_test/AggregationTest/)
 you can see it failed in every run: 
   
![image](https://user-images.githubusercontent.com/675055/140591167-ef72f201-b323-4c4b-86e8-9750867a4fc8.png)
   
   and it's consistently producing -0.4 rather than -1.7 for column B. I'd 
suggest looking closer at the column B case from that test: 
https://github.com/apache/beam/blob/a3bb58dbd4fc6a59f93f69f9ab1980f8232b6e82/sdks/python/apache_beam/dataframe/frames_test.py#L1491
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 677973)
    Time Spent: 8h  (was: 7h 50m)

> Implement parallelizable skew and kurtosis 
> -------------------------------------------
>
>                 Key: BEAM-12550
>                 URL: https://issues.apache.org/jira/browse/BEAM-12550
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>          Time Spent: 8h
>  Remaining Estimate: 0h
>
> skew and kurtosis should be parallelizable/lifftable by using a similar 
> [approach as std and 
> var|https://github.com/apache/beam/blob/a0f5e932d8a9aa491b16361abdc629b5e9a483f6/sdks/python/apache_beam/dataframe/frames.py#L1307-L1310].
>  See 
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
> which has information on extending that approach to calculating the third and 
> fourth central moments, needed for skew and kurtosis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to