[
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Valentyn Tymofieiev updated BEAM-4858:
--------------------------------------
Description:
Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729]
non-trivial performance-sensitive logic in element-batching transform. Let's
take a look at
[util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on
the type of the keys - whether they are integers or floats.
The keys of key-value pairs contained in {{self._data}} are added as integers
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
however, when we 'thin' the collected entries
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
the keys will become floats. Surprisingly, using either integer or float
division consistently [in the
comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
negatively affects the performance of a custom pipeline I was using to
benchmark these changes.
In terms of Python 3 conversion the best course of action that avoids
regression seems to be to preserve the existing Python 2 behavior using
{{old_div}} from {{past.utils.division}}, in the medium term we should clean up
the logic. We may want to add a targeted microbenchmark to evaluate performance
of this code, and maybe cythonize the code, since it seems to be
performance-sensitive.
was:
Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729]
non-trivial performance-sensitive logic in element-batching transform. Let's
take a look at
[util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on
the type of the keys - whether they are integers or floats.
The keys of key-value pairs contained in {{self._data}} are added as integers
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
however, when we 'thin' the collected entries
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
the keys will become floats. Surprisingly, using either integer or float
division [in the
comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
consistently negatively affects the performance of a custom pipeline I was
using to benchmark these changes.
In terms of Python 3 conversion the best course of action that avoids
regression seems to be to preserve the existing Python 2 behavior using
{{old_div}} from {{past.utils.division}}, in the medium term we should clean up
the logic. We may want to add a targeted microbenchmark to evaluate performance
of this code, and maybe cythonize the code, since it seems to be
performance-sensitive.
> Clean up _BatchSizeEstimator in element-batching transform.
> -----------------------------------------------------------
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Valentyn Tymofieiev
> Assignee: Robert Bradshaw
> Priority: Minor
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729]
> non-trivial performance-sensitive logic in element-batching transform. Let's
> take a look at
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on
> the type of the keys - whether they are integers or floats.
> The keys of key-value pairs contained in {{self._data}} are added as integers
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
> however, when we 'thin' the collected entries
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
> the keys will become floats. Surprisingly, using either integer or float
> division consistently [in the
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
> negatively affects the performance of a custom pipeline I was using to
> benchmark these changes.
> In terms of Python 3 conversion the best course of action that avoids
> regression seems to be to preserve the existing Python 2 behavior using
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean
> up the logic. We may want to add a targeted microbenchmark to evaluate
> performance of this code, and maybe cythonize the code, since it seems to be
> performance-sensitive.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)