Re: Python SDK worker / portable Flink runner performance improvements

2018-10-21 Thread Thomas Weise
Regarding the functionality:

https://s.apache.org/apache-beam-portability-support-table

While we still have a good chunk of work to do, the MVP feature set is in
place and allows to run pipelines.

Before we check P2 (feature complete), I would like to see (in addition to
what Max mentioned):

* Support for metrics (user defined and sdk worker - sdk worker is
currently a black box): It should be possible to get both of these as Flink
metrics to support existing observability infrastructure.
* Support for streaming connectors at least in the Python SDK

The support table should reflect that (some rows and JIRAs are currently
missing).

Thomas


On Fri, Oct 19, 2018 at 7:11 AM Kenneth Knowles  wrote:

> This is really cool news. Pretty awesome to move from the "get it to run"
> phase to the "get it to run faster" phase of this project.
>
> Streaming testing: In Java there's a synthetic source (GenerateSequence /
> CountingSource) for testing. Maybe in this case I'd say porting to py is
> worth it?
>
> Kenn
>
> On Wed, Oct 17, 2018 at 2:00 PM Lukasz Cwik  wrote:
>
>> Thanks, this was useful for me since I have been away these past couple
>> of weeks.
>>
>> On Wed, Oct 17, 2018 at 8:45 AM Thomas Weise  wrote:
>>
>>> Hi,
>>>
>>> As you may have noticed, some of the contributors are working on
>>> enabling the Python support on Flink. The upcoming 2.8 release is going to
>>> include much of the functionality and we are now shifting gears to
>>> stability and performance.
>>>
>>> There have been some basic fixes already (logging, memory leak) and at
>>> this point we see very low throughput in streaming mode. Improvements are
>>> in-flight:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-5760
>>> https://issues.apache.org/jira/browse/BEAM-5521
>>>
>>> There has been discussion and preliminary work to improve support for
>>> testing as well (streaming mode). The Python SDK currently doesn't have any
>>> (open source) streaming connectors, but we have added a Flink native
>>> transform that can be used for testing:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-5707
>>>
>>> I'm starting this thread here so that it is easier for more folks to get
>>> involved and stay in sync.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>>


Re: [Proposal] Euphoria DSL - looking for reviewers

2018-10-21 Thread Kenneth Knowles
This discussion veered into territory reserved for priv...@beam.apache.org
[1]. But the PMC has agreed that an update is deserved here:

We are taking it very seriously that you have placed faith in Beam and that
you need to be able to effectively continue development of Euphoria. We are
actively discussing it. PMC processes take some time, in part due to
geographic distances and customary pauses, so I have to ask for a little
patience.

If you have any more questions or concerns, please do feel free to reach
out to me or other PMC members privately.

Kenn

[1] https://apache.org/dev/pmc.html#mailing-list-private

On Wed, Oct 17, 2018 at 8:32 AM David Morávek 
wrote:

> That's great news. Thank you!
>
> D.
>
> On Tue, Oct 16, 2018 at 6:06 PM Thomas Weise  wrote:
>
>> Congrats to the Euphoria team!
>>
>> On Tue, Oct 16, 2018 at 8:51 AM Kenneth Knowles  wrote:
>>
>>> Merged. Welcome to the repo :-)
>>>
>>> Kenn
>>>
>>> On Thu, Oct 11, 2018 at 10:06 AM Kenneth Knowles 
>>> wrote:
>>>
 I've filed the IP Clearance. I'll report back here.

 Kenn

 On Wed, Oct 10, 2018 at 3:33 PM David Morávek 
 wrote:

>
>
> Anton:
> All of the points are be correct, with one minor exception. We are
> currently moving our production workloads from Euphoria
>  to Beam (using the DSL), but we
> are hitting scalability issues of the current spark runner, so it is not
> technically used in production yet. Everything behaves correctly in the
> staging environment, where runner can handle the workload.
>
> Kenn:
> here is the the IP Clearance document
> https://gist.github.com/dmvk/80acb0579f196e18c02a4e280978d445
>
> Thanks,
> David
>
> On Wed, Oct 10, 2018 at 11:30 PM Kenneth Knowles 
> wrote:
>
>> I just glanced through it to make sure things are in the right place
>> and build set up right and that all LGTM.
>>
>> We need to file the IP Clearance to finish the process that Davor
>> started. Please fill the XML template at
>> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/ip-clearance-template.xml
>> then I will review and file it in SVN.
>>
>> Kenn
>>
>> On Wed, Oct 10, 2018 at 2:15 PM Anton Kedin  wrote:
>>
>>> I think the code looks good and we should probably just merge it
>>> (unless there are other blockers, e.g. formal approvals), considering:
>>>  - it has been reviewed;
>>>  - it is tested and used in production;
>>>  - it was discussed on the list and there were no objections to
>>> having it as part of Beam;
>>>  - it is a standalone extension and doesn't interfere with Beam Java
>>> SDK, if I didn't miss anything;
>>>  - it has people working on it and supporting it;
>>>
>>> All other issues can probably be sorted out in normal Beam process.
>>>
>>> Regards,
>>> Anton
>>>
>>> On Wed, Oct 10, 2018 at 5:57 AM David Morávek <
>>> david.mora...@gmail.com> wrote:
>>>
 Hello Max,

 It would be great if you can do more of a "general" review, the
 code base is fairly large, well tested and it was already reviewed
 internally by several people.

 We would like to have the overall approach and design decisions
 validated by the community and get some inputs on what could be 
 improved
 and if we are headed the right direction.

 Thanks,
 David

 On Wed, Oct 10, 2018 at 2:21 PM Maximilian Michels 
 wrote:

> That is a huge PR! :) Euphoria looks great. Especially for people
> coming
> from Flink/Spark. I'll check out the documentation.
>
> Do you have any specific code parts which you want to have
> reviewed?
>
> Thanks,
> Max
>
> On 10.10.18 10:30, Jean-Baptiste Onofré wrote:
> > Hi,
> >
> > Thanks for all the work you are doing on this DSL !
> >
> > I tried to follow the features branch for a while. I'm still
> committed
> > to  move forward on that front,  but more reviewers would be
> great.
> >
> > Regards
> > JB
> >
> > On 10/10/2018 10:26, Plajt, Vaclav wrote:
> >> Hello Beam devs,
> >> we finished our main goals in development of Euphoria DSL. It
> is Easy to
> >> use Java 8 API build on top of the Beam's Java SDK. API
> provides a
> >> high-level abstraction of data transformations, with focus on
> the Java 8
> >> language features (e.g. lambdas and streams). It is fully
> inter-operable
> >> with existing Beam SDK and convertible back and forth. It
> allows fast
> >> prototyping through use of (optional) Kryo based 

Re: Docker missing on Beam15

2018-10-21 Thread Thomas Weise
There are two issues with
https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/ currently:

1) The mentioned issue with docker on beam15 - Jason, can you possibly
advise how to deal with it?

2) Frequent failure due to "Segmentation fault (core dumped)", as exhibited
by
https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/449/consoleText

The Gradle scan is here:

https://scans.gradle.com/s/ebhxs4l65cow4/failure?openFailures=WzBd=WzEse31d#top=0

There are multiple of those in sequence on beam13

Some more comments: https://issues.apache.org/jira/browse/BEAM-5467

Any help to further investigate or fix would be appreciated!

Thanks,
Thomas



On Fri, Oct 19, 2018 at 4:51 PM Yifan Zou  wrote:

> I got "Failed to restart docker.service: Interactive authentication
> required" while trying to restart the docker on beam15.
> Does anyone have the permission to do that? Or, we need to ask Apache
> Infra for help.
>
> Thanks.
> Yifan
>
> On Fri, Oct 19, 2018 at 2:51 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> Can we restart docker as it seems to have fixed the issue for others
>> https://github.com/moby/moby/issues/31849 ?
>>
>> Thanks,
>> Ankur
>>
>> On Fri, Oct 19, 2018 at 1:11 PM Yifan Zou  wrote:
>>
>>> Hi,
>>>
>>> The docker has been installed on all Jenkins VMs. The image build
>>> process was interrupted by a grpc connection issue.
>>>
>>> *11:02:12* Starting process 'command 'docker''. Working directory: 
>>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_VR_Flink/src/sdks/python/container/build/docker
>>>  Command: docker build --no-cache -t 
>>> jenkins-docker-apache.bintray.io/beam/python:latest .*11:02:12* 
>>> Successfully started process 'command 'docker''*11:02:12* Sending build 
>>> context to Docker daemon  17.65MB
>>> *11:02:12* Step 1/9 : FROM python:2-stretch*11:02:12*  ---> 
>>> 3c43a5d4034a*11:02:12* Step 2/9 : MAINTAINER "Apache Beam 
>>> "*11:02:12*  ---> Running in f86bad9aef9c*11:02:12*  
>>> ---> 610a5dec907e*11:02:12* Removing intermediate container 
>>> f86bad9aef9c*11:02:12* Step 3/9 : RUN apt-get update && apt-get install 
>>> -ylibsnappy-devlibyaml-dev&& rm -rf 
>>> /var/lib/apt/lists/**11:02:12*  ---> Running in 5e9b67be03f9*11:02:12* 
>>> grpc: the connection is unavailable
>>>
>>>
>>> - Yifan
>>>
>>>
>>>
>>> On Fri, Oct 19, 2018 at 12:45 PM Ankur Goenka  wrote:
>>>
 Hi,

 Flink Validates Runner test cases are failing on Beam 15 because docker
 is not installed.
 Failing tasks
 https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/buildTimeTrend
 Can we install docker on all the machines as the Portable Validates
 Runner tests need it.

 Thanks,
 Ankur

>>>


Possible memory leak in Direct Runner unbounded

2018-10-21 Thread Martin Procházka
Hello,
I have got an application, which utilizes Beam pipeline - Direct Runner. It
contains an unbounded source. I have got a frontend, which manually adds
some data into the pipeline with the same timestamp in order to be
processed in the same window.

The pipeline runs well, however it eventually runs out of heap space. I
have profiled the application and have noticed that there is a hotspot in
outputWatermark - holds - keyedHolds. It gets swamped mainly by values
keyed by the anonymous StructuralKey 'empty' classes over time. With every
request it grows and never gets released.

When I changed the empty structural key to true singleton, it solved a part
of this issue, but I have noticed that there is a specific test that
ensures that two empty keys (StructuralKey) are not equal so my change
would not be valid. When are those empty keys used and when should they be
removed in the Direct runner? Is there some mechanism to prevent the
inevitable heap out of memory error after few requests?

Regards,
Martin Prochazka


2 tier input

2018-10-21 Thread Chaim Turkel
hi,
  I have the following flow i need to implement.
>From the bigquery i run a query and get a list of id's then i need to
load from mongo all the documents based on these id's and export them
as an xml file.
How do you suggest i go about doing this?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures.