Hi,

I'm on it, still investigating/digging.

Regards
JB

On 17/07/2018 09:44, Ismaël Mejía wrote:
> Given that the 2.6.0 cut is supposed to be today (or next days), what
> is the status on this, has it been identified / reverted ? or is there
> any other plan ?
> On Tue, Jul 10, 2018 at 2:50 PM Reuven Lax <re...@google.com> wrote:
>>
>> If we added something slow to the core library in order to better test 
>> DirectRunner, that does sound like an unfortunate bug.
>>
>> On Mon, Jul 9, 2018 at 11:21 PM Vojtech Janota <vojta.jan...@gmail.com> 
>> wrote:
>>>
>>> Hi guys,
>>>
>>> Thank you for all of your feedback. I have created relevant issue in JIRA: 
>>> https://issues.apache.org/jira/browse/BEAM-4750
>>>
>>> @Lukasz: me mentioning the DirectRunner was somewhat unfortunate - the 
>>> bottleneck was introduced into the core library and so Flink and Spark 
>>> runners would be impacted too
>>>
>>> Thanks,
>>> Vojta
>>>
>>> On Mon, Jul 9, 2018 at 5:48 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>> Instead of reverting/working around specific checks/tests that the 
>>>> DirectRunner is doing, have you considered using one of the other runners 
>>>> like Flink or Spark with a local execution cluster. You won't hit the 
>>>> validation/verification bottlenecks that DirectRunner specifically imposes.
>>>>
>>>> On Mon, Jul 9, 2018 at 8:46 AM Jean-Baptiste Onofré <j...@nanthrax.net> 
>>>> wrote:
>>>>>
>>>>> Thanks for the update Eugene.
>>>>>
>>>>> @Vojta: do you mind to create a Jira ? I will tackle a fix for that.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 09/07/2018 17:33, Eugene Kirpichov wrote:
>>>>>> Hi -
>>>>>>
>>>>>> If I remember correctly, the reason for this change was to ensure that
>>>>>> the state is encodable at all. Prior to the change, there had been
>>>>>> situations where the coder specified on a state cell is buggy, absent or
>>>>>> set incorrectly (due to some issue in coder inference), but direct
>>>>>> runner did not detect this because it never tried to encode the state
>>>>>> cells - this would have blown up in any distributed runner.
>>>>>>
>>>>>> I think it should be possible to relax this and clone only values being
>>>>>> added to the state, rather than cloning the whole state on copy(). I
>>>>>> don't have time to work on this change myself, but I can review a PR if
>>>>>> someone else does.
>>>>>>
>>>>>> On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré <j...@nanthrax.net
>>>>>> <mailto:j...@nanthrax.net>> wrote:
>>>>>>
>>>>>>     Hi Vojta,
>>>>>>
>>>>>>     I fully agree, that's why it makes sense to wait Eugene's feedback.
>>>>>>
>>>>>>     I remember we had some performance regression on the direct runner
>>>>>>     identified thanks to Nexmark, but it has been addressed by reverting 
>>>>>> a
>>>>>>     change.
>>>>>>
>>>>>>     Good catch anyway !
>>>>>>
>>>>>>     Regards
>>>>>>     JB
>>>>>>
>>>>>>     On 09/07/2018 17:20, Vojtech Janota wrote:
>>>>>>     > Hi Reuven,
>>>>>>     >
>>>>>>     > I'm not really complaining about DirectRunner. In fact it seems to
>>>>>>     me as
>>>>>>     > if what previously was considered as part of the "expensive extra
>>>>>>     > checks" done by the DirectRunner is now done within the
>>>>>>     > beam-runners-core-java library. Considering that all objects 
>>>>>> involved
>>>>>>     > are immutable (in our case at least) and simple assignment is
>>>>>>     > sufficient, the serialization-deserialization really seems as 
>>>>>> unwanted
>>>>>>     > and hugely expensive correctness check. If there was a problem with
>>>>>>     > identity copy, wasn't DirectRunner supposed to reveal it?
>>>>>>     >
>>>>>>     > Regards,
>>>>>>     > Vojta
>>>>>>     >
>>>>>>     > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax <re...@google.com
>>>>>>     <mailto:re...@google.com>
>>>>>>     > <mailto:re...@google.com <mailto:re...@google.com>>> wrote:
>>>>>>     >
>>>>>>     >     Hi Vojita,
>>>>>>     >
>>>>>>     >     One problem is that the DirectRunner is designed for testing, 
>>>>>> not
>>>>>>     >     for performance. The DirectRunner currently does many
>>>>>>     >     purposely-inefficient things, the point of which is to better
>>>>>>     expose
>>>>>>     >     potential bugs in tests. For example, the DirectRunner will
>>>>>>     randomly
>>>>>>     >     shuffle the order of PCollections to ensure that your code
>>>>>>     does not
>>>>>>     >     rely on ordering.  All of this adds cost, because the current
>>>>>>     runner
>>>>>>     >     is designed for testing. There have been requests in the past
>>>>>>     for an
>>>>>>     >     "optimized" local runner, however we don't currently have such
>>>>>>     a thing.
>>>>>>     >
>>>>>>     >     In this case, using coders to clone values is more correct. In 
>>>>>> a
>>>>>>     >     distributed environment using encode/decode is the only way to
>>>>>>     copy
>>>>>>     >     values, and the DirectRunner is trying to ensure that your 
>>>>>> code is
>>>>>>     >     correct in a distributed environment.
>>>>>>     >
>>>>>>     >     Reuven
>>>>>>     >
>>>>>>     >     On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
>>>>>>     >     <vojta.jan...@gmail.com <mailto:vojta.jan...@gmail.com>
>>>>>>     <mailto:vojta.jan...@gmail.com <mailto:vojta.jan...@gmail.com>>> 
>>>>>> wrote:
>>>>>>     >
>>>>>>     >         Hi,
>>>>>>     >
>>>>>>     >         We are using Apache Beam in our project for some time now.
>>>>>>     Since
>>>>>>     >         our datasets are of modest size, we have so far used
>>>>>>     >         DirectRunner as the computation easily fits onto a single
>>>>>>     >         machine. Recently we upgraded Beam from 2.2 to 2.4 and
>>>>>>     found out
>>>>>>     >         that performance of our pipelines drastically deteriorated.
>>>>>>     >         Pipelines that took ~3 minutes with 2.2 do not finish 
>>>>>> within
>>>>>>     >         hours now. We tried to isolate the change that causes the
>>>>>>     >         slowdown and came to the commits into the
>>>>>>     >         "InMemoryStateInternals" class:
>>>>>>     >
>>>>>>     >         * https://github.com/apache/beam/commit/32a427c
>>>>>>     >         <https://github.com/apache/beam/commit/32a427c>
>>>>>>     >         * https://github.com/apache/beam/commit/8151d82
>>>>>>     >         <https://github.com/apache/beam/commit/8151d82>
>>>>>>     >
>>>>>>     >         In a nutshell where previously the copy() method simply
>>>>>>     assigned:
>>>>>>     >
>>>>>>     >           that.value = this.value
>>>>>>     >
>>>>>>     >         There is now coder encode/decode combo hidden behind:
>>>>>>     >
>>>>>>     >           that.value = uncheckedClone(coder, this.value)
>>>>>>     >
>>>>>>     >         Can somebody explain the purpose of this change? Is it
>>>>>>     meant as
>>>>>>     >         an additional "enforcement" point, similar to 
>>>>>> DirectRunner's
>>>>>>     >         enforceImmutability and enforceEncodability? Or is it
>>>>>>     something
>>>>>>     >         that is genuinely needed to provide correct behaviour of 
>>>>>> the
>>>>>>     >         pipeline?
>>>>>>     >
>>>>>>     >         Any hints or thoughts are appreciated.
>>>>>>     >
>>>>>>     >         Regards,
>>>>>>     >         Vojta
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>
>>>>>>     --
>>>>>>     Jean-Baptiste Onofré
>>>>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>     http://blog.nanthrax.net
>>>>>>     Talend - http://www.talend.com
>>>>>>
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbono...@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>
>>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to