[ 
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644983#comment-16644983
 ] 

Robert Bradshaw commented on BEAM-4858:
---------------------------------------

Submitted pr/6375.
 * Simplifies the aging out of old data. Not only was the old formula hard to
understand, but it meant that bad data could stick around and poison the
estimates forever.
 * Adds a variance parameter allowing the batch size to vary over a fixed
range giving a broader base for the linear regression.
 * Uses numpy when available to do the regression. This is both much more
efficient and allows for less error-prone expression of more complicated
analysis.

The algorithm was also changed to:
 * Eliminates outliers, both using Cook's distance and just throwing out the
top (often high-variance and high-influence) 20% when there is sufficient
data.
 * Weight by the inverse of batch size, to provide a more stable fixed size
estimate (which the default "overhead" target is sensitive to).

These changes were tested against a large TFT pipeline and found to produce
more uniform batch sizes and similar, possibly slightly improved, total
runtimes and total costs.

> Clean up _BatchSizeEstimator in element-batching transform.
> -----------------------------------------------------------
>
>                 Key: BEAM-4858
>                 URL: https://issues.apache.org/jira/browse/BEAM-4858
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Robert Bradshaw
>            Priority: Minor
>             Fix For: 2.8.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to