[jira] [Updated] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

Valentyn Tymofieiev (JIRA) Tue, 24 Jul 2018 16:02:07 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Valentyn Tymofieiev updated BEAM-4858:
--------------------------------------
    Description: 
Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
non-trivial performance-sensitive logic in element-batching transform. Let's 
take a look at 
[util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
 

Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
the type of the keys - whether they are integers or floats. 

The keys of key-value pairs contained in {{self._data}} are added as integers 
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
 however, when we 'thin' the collected entries 
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
 the keys will become floats. Surprisingly, using either integer or float 
division consistently [in the 
comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
  negatively affects the performance of a custom pipeline I was using to 
benchmark these changes.

In terms of Python 3 conversion the best course of action that avoids 
regression seems to be to preserve the existing Python 2 behavior using 
{{old_div}} from {{past.utils.division}}, in the medium term we should clean up 
the logic. We may want to add a targeted microbenchmark to evaluate performance 
of this code, and maybe cythonize the code, since it seems to be 
performance-sensitive.


  was:
Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
non-trivial performance-sensitive logic in element-batching transform. Let's 
take a look at 
[util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
 

Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
the type of the keys - whether they are integers or floats. 

The keys of key-value pairs contained in {{self._data}} are added as integers 
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
 however, when we 'thin' the collected entries 
[here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
 the keys will become floats. Surprisingly, using either integer or float 
division [in the 
comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
 consistently negatively affects the performance of a custom pipeline I was 
using to benchmark these changes.

In terms of Python 3 conversion the best course of action that avoids 
regression seems to be to preserve the existing Python 2 behavior using 
{{old_div}} from {{past.utils.division}}, in the medium term we should clean up 
the logic. We may want to add a targeted microbenchmark to evaluate performance 
of this code, and maybe cythonize the code, since it seems to be 
performance-sensitive.



> Clean up _BatchSizeEstimator in element-batching transform.
> -----------------------------------------------------------
>
>                 Key: BEAM-4858
>                 URL: https://issues.apache.org/jira/browse/BEAM-4858
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Robert Bradshaw
>            Priority: Minor
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

Reply via email to