Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Thomas Weise
Try removing under sdks:

./go/vendor
./python/container/vendor
./go/.gogradle
./python/container/.gogradle


On Mon, Nov 19, 2018 at 12:01 PM Ruoyun Huang  wrote:

> Unfortunately, flink server still doesn't work consistently on my machine
> yet.  Funny thing is, it did worked ONCE (
> :beam-sdks-python:portableWordCount BUILD successful, finished in 18s).
> When I tried gain, things were back to hanging with server printing
> messages like:
>
> """
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received
> slot report from instance 1ad9060bcc87cf5fd19c9a233c15a18f.
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.jobmaster.JobMaster - Trigger heartbeat request.
> [flink-akka.actor.default-dispatcher-23] DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat
> request from 006b3653dc7a24471c115d70c4c55fa6.
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.jobmaster.JobMaster - Received heartbeat from
> e188c32c-cfa5-4b85-bda9-16ce4742c490.
> ...
> repeat above forever after 5 minutes.
> """
>
> I am trying to figure out what I did right for that one time succeeded
> run.
>
> For the step 3 Thomas mentioned, all I did for cleanup is "gradle clean",
> if there are actually more to do, please kindly let me know.
>
>
>
>
> On Mon, Nov 19, 2018 at 6:00 AM Maximilian Michels  wrote:
>
>> Thanks for investing, Thomas!
>>
>> Ruoyun, does that solve the WordCount problem you were experiencing?
>>
>> -Max
>>
>> On 19.11.18 04:53, Thomas Weise wrote:
>> > With latest master the problem seems fixed. Unfortunately that was
>> first
>> > masked by build and docker issues. But I changed multiple things at
>> once
>> > after getting nowhere (the container build "succeeded" when in fact it
>> > did not):
>> >
>> > * Update to latest docker
>> > * Increase docker disk space after seeing a spurious, non-reproducible
>> > message in one of the build attempts
>> > * Full clean and manually remove Go build residuals from the workspace
>> >
>> > After that I could see Go and container builds execute differently
>> > (longer build time) and the result certainly looks better..
>> >
>> > HTH,
>> > Thomas
>> >
>> >
>> >
>> > On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang > > > wrote:
>> >
>> > I was after the same issue (I was using reference runner job server,
>> > but same error message), had some clue but no conclusion yet.
>> >
>> > By retaining the container instance, error message says "bad MD5"
>> > (see the other thread [1] I asked in dev last week). My hypothesis,
>> > based on the symptoms, is that the underlying container expects an
>> > MD5 to validate staged files, but job request from python SDK does
>> > not send file hash code.  Hope someone can confirm if that is the
>> > case (I am still trying to understand how come dataflow does not
>> > have such issue), and if so, the best way to fix it.
>> >
>> >
>> > [1]
>> >
>> https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E
>> >
>> > On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise > > > wrote:
>> >
>> > Since last few days, the steps under
>> > https://beam.apache.org/roadmap/portability/#python-on-flink
>> are
>> > broken.
>> >
>> > The gradle task hangs because the job server isn't able to
>> > launch the docker container.
>> >
>> > ./gradlew :beam-sdks-python:portableWordCount
>> > -PjobEndpoint=localhost:8099
>> >
>> > [CHAIN MapPartition (MapPartition at
>> >
>>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
>> > FlatMap (FlatMap at
>> >
>>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
>> > (8/8)] INFO
>> >
>>  org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
>> > - Still waiting for startup of environment
>> > tweise-docker-apache.bintray.io/beam/python:latest
>> >  for
>> > worker id 1
>> >
>> > Unfortunately this isn't covered by tests yet. Is anyone aware
>> > what change may have caused this or looking into resolving it?
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> >
>> > --
>> > 
>> > Ruoyun  Huang
>> >
>>
>
>
> --
> 
> Ruoyun  Huang
>
>


Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Ruoyun Huang
Unfortunately, flink server still doesn't work consistently on my machine
yet.  Funny thing is, it did worked ONCE (
:beam-sdks-python:portableWordCount BUILD successful, finished in 18s).
When I tried gain, things were back to hanging with server printing
messages like:

"""
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received
slot report from instance 1ad9060bcc87cf5fd19c9a233c15a18f.
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster - Trigger heartbeat request.
[flink-akka.actor.default-dispatcher-23] DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat
request from 006b3653dc7a24471c115d70c4c55fa6.
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster - Received heartbeat from
e188c32c-cfa5-4b85-bda9-16ce4742c490.
...
repeat above forever after 5 minutes.
"""

I am trying to figure out what I did right for that one time succeeded run.


For the step 3 Thomas mentioned, all I did for cleanup is "gradle clean",
if there are actually more to do, please kindly let me know.




On Mon, Nov 19, 2018 at 6:00 AM Maximilian Michels  wrote:

> Thanks for investing, Thomas!
>
> Ruoyun, does that solve the WordCount problem you were experiencing?
>
> -Max
>
> On 19.11.18 04:53, Thomas Weise wrote:
> > With latest master the problem seems fixed. Unfortunately that was first
> > masked by build and docker issues. But I changed multiple things at once
> > after getting nowhere (the container build "succeeded" when in fact it
> > did not):
> >
> > * Update to latest docker
> > * Increase docker disk space after seeing a spurious, non-reproducible
> > message in one of the build attempts
> > * Full clean and manually remove Go build residuals from the workspace
> >
> > After that I could see Go and container builds execute differently
> > (longer build time) and the result certainly looks better..
> >
> > HTH,
> > Thomas
> >
> >
> >
> > On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang  > > wrote:
> >
> > I was after the same issue (I was using reference runner job server,
> > but same error message), had some clue but no conclusion yet.
> >
> > By retaining the container instance, error message says "bad MD5"
> > (see the other thread [1] I asked in dev last week). My hypothesis,
> > based on the symptoms, is that the underlying container expects an
> > MD5 to validate staged files, but job request from python SDK does
> > not send file hash code.  Hope someone can confirm if that is the
> > case (I am still trying to understand how come dataflow does not
> > have such issue), and if so, the best way to fix it.
> >
> >
> > [1]
> >
> https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E
> >
> > On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise  > > wrote:
> >
> > Since last few days, the steps under
> > https://beam.apache.org/roadmap/portability/#python-on-flink are
> > broken.
> >
> > The gradle task hangs because the job server isn't able to
> > launch the docker container.
> >
> > ./gradlew :beam-sdks-python:portableWordCount
> > -PjobEndpoint=localhost:8099
> >
> > [CHAIN MapPartition (MapPartition at
> >
>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
> > FlatMap (FlatMap at
> >
>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
> > (8/8)] INFO
> >
>  org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
> > - Still waiting for startup of environment
> > tweise-docker-apache.bintray.io/beam/python:latest
> >  for
> > worker id 1
> >
> > Unfortunately this isn't covered by tests yet. Is anyone aware
> > what change may have caused this or looking into resolving it?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > --
> > 
> > Ruoyun  Huang
> >
>


-- 

Ruoyun  Huang


Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Maximilian Michels

Thanks for investing, Thomas!

Ruoyun, does that solve the WordCount problem you were experiencing?

-Max

On 19.11.18 04:53, Thomas Weise wrote:
With latest master the problem seems fixed. Unfortunately that was first 
masked by build and docker issues. But I changed multiple things at once 
after getting nowhere (the container build "succeeded" when in fact it 
did not):


* Update to latest docker
* Increase docker disk space after seeing a spurious, non-reproducible 
message in one of the build attempts

* Full clean and manually remove Go build residuals from the workspace

After that I could see Go and container builds execute differently 
(longer build time) and the result certainly looks better..


HTH,
Thomas



On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang > wrote:


I was after the same issue (I was using reference runner job server,
but same error message), had some clue but no conclusion yet.

By retaining the container instance, error message says "bad MD5"
(see the other thread [1] I asked in dev last week). My hypothesis,
based on the symptoms, is that the underlying container expects an
MD5 to validate staged files, but job request from python SDK does
not send file hash code.  Hope someone can confirm if that is the
case (I am still trying to understand how come dataflow does not
have such issue), and if so, the best way to fix it.


[1]

https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E

On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise mailto:t...@apache.org>> wrote:

Since last few days, the steps under
https://beam.apache.org/roadmap/portability/#python-on-flink are
broken.

The gradle task hangs because the job server isn't able to
launch the docker container.

./gradlew :beam-sdks-python:portableWordCount
-PjobEndpoint=localhost:8099

[CHAIN MapPartition (MapPartition at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
FlatMap (FlatMap at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
(8/8)] INFO
org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
- Still waiting for startup of environment
tweise-docker-apache.bintray.io/beam/python:latest
 for
worker id 1

Unfortunately this isn't covered by tests yet. Is anyone aware
what change may have caused this or looking into resolving it?

Thanks,
Thomas



-- 


Ruoyun  Huang



Re: Portable wordcount on Flink runner broken

2018-11-18 Thread Thomas Weise
With latest master the problem seems fixed. Unfortunately that was first
masked by build and docker issues. But I changed multiple things at once
after getting nowhere (the container build "succeeded" when in fact it did
not):

* Update to latest docker
* Increase docker disk space after seeing a spurious, non-reproducible
message in one of the build attempts
* Full clean and manually remove Go build residuals from the workspace

After that I could see Go and container builds execute differently (longer
build time) and the result certainly looks better..

HTH,
Thomas





On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang  wrote:

> I was after the same issue (I was using reference runner job server, but
> same error message), had some clue but no conclusion yet.
>
> By retaining the container instance, error message says "bad MD5" (see the
> other thread [1] I asked in dev last week). My hypothesis, based on the
> symptoms, is that the underlying container expects an MD5 to validate
> staged files, but job request from python SDK does not send file hash
> code.  Hope someone can confirm if that is the case (I am still trying to
> understand how come dataflow does not have such issue), and if so, the best
> way to fix it.
>
>
> [1]
> https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E
>
> On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise  wrote:
>
>> Since last few days, the steps under
>> https://beam.apache.org/roadmap/portability/#python-on-flink are broken.
>>
>> The gradle task hangs because the job server isn't able to launch the
>> docker container.
>>
>> ./gradlew :beam-sdks-python:portableWordCount -PjobEndpoint=localhost:8099
>>
>> [CHAIN MapPartition (MapPartition at
>> 36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
>> FlatMap (FlatMap at
>> 36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
>> (8/8)] INFO
>> org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory -
>> Still waiting for startup of environment
>> tweise-docker-apache.bintray.io/beam/python:latest for worker id 1
>>
>> Unfortunately this isn't covered by tests yet. Is anyone aware what
>> change may have caused this or looking into resolving it?
>>
>> Thanks,
>> Thomas
>>
>>
>
> --
> 
> Ruoyun  Huang
>
>


Portable wordcount on Flink runner broken

2018-11-16 Thread Thomas Weise
Since last few days, the steps under
https://beam.apache.org/roadmap/portability/#python-on-flink are broken.

The gradle task hangs because the job server isn't able to launch the
docker container.

./gradlew :beam-sdks-python:portableWordCount -PjobEndpoint=localhost:8099

[CHAIN MapPartition (MapPartition at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
FlatMap (FlatMap at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
(8/8)] INFO
org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory -
Still waiting for startup of environment
tweise-docker-apache.bintray.io/beam/python:latest for worker id 1

Unfortunately this isn't covered by tests yet. Is anyone aware what change
may have caused this or looking into resolving it?

Thanks,
Thomas