[ https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644983#comment-16644983 ]
Robert Bradshaw commented on BEAM-4858: --------------------------------------- Submitted pr/6375. * Simplifies the aging out of old data. Not only was the old formula hard to understand, but it meant that bad data could stick around and poison the estimates forever. * Adds a variance parameter allowing the batch size to vary over a fixed range giving a broader base for the linear regression. * Uses numpy when available to do the regression. This is both much more efficient and allows for less error-prone expression of more complicated analysis. The algorithm was also changed to: * Eliminates outliers, both using Cook's distance and just throwing out the top (often high-variance and high-influence) 20% when there is sufficient data. * Weight by the inverse of batch size, to provide a more stable fixed size estimate (which the default "overhead" target is sensitive to). These changes were tested against a large TFT pipeline and found to produce more uniform batch sizes and similar, possibly slightly improved, total runtimes and total costs. > Clean up _BatchSizeEstimator in element-batching transform. > ----------------------------------------------------------- > > Key: BEAM-4858 > URL: https://issues.apache.org/jira/browse/BEAM-4858 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: Valentyn Tymofieiev > Assignee: Robert Bradshaw > Priority: Minor > Fix For: 2.8.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] > non-trivial performance-sensitive logic in element-batching transform. Let's > take a look at > [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]. > > Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on > the type of the keys - whether they are integers or floats. > The keys of key-value pairs contained in {{self._data}} are added as integers > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260], > however, when we 'thin' the collected entries > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279], > the keys will become floats. Surprisingly, using either integer or float > division consistently [in the > comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271] > negatively affects the performance of a custom pipeline I was using to > benchmark these changes. The performance impact likely comes from changes in > the logic that depends on how division is evaluated, not from the > performance of division operation itself. > In terms of Python 3 conversion the best course of action that avoids > regression seems to be to preserve the existing Python 2 behavior using > {{old_div}} from {{past.utils.division}}, in the medium term we should clean > up the logic. We may want to add a targeted microbenchmark to evaluate > performance of this code, and maybe cythonize the code, since it seems to be > performance-sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)