Re: Beam spark 2.x runner status

2017-03-23 Thread Kobi Salant
Hi,

We use SparkContext & SparkContextStreaming extensively in Spark runner to
create the Dsteams & Rdds so we will need to work on migrating from the 1.X
terms to 2.X terms (We may other incompatibilities that we will find out
during the work).

Regards
Kobi


2017-03-23 6:55 GMT+02:00 Jean-Baptiste Onofré :

> Hi guys,
>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>
>> Amit, I suppose JB is talking about the RDD based version, so no need
>> to worry about SparkSession or different incompatible APIs.
>>
>> Remember the idea we are discussing is to have in master both the
>> spark 1 and spark 2 runners using the RDD based translation. At the
>> same time we can have a feature branch to evolve the DataSet based
>> translator (this one will replace the RDD based translator for spark 2
>> once it is mature).
>>
>> The advantages have been already discussed as well as the possible
>> issues so I think we have to see now if JB's idea is feasible and how
>> hard would be to live with this while the DataSet version evolves.
>>
>> I think what we are trying to avoid is to have a long living branch
>> for a spark 2 runner based on RDD  because the maintenance burden
>> would be even worse. We would have to fight not only with the double
>> merge of fixes (in case the profile idea does not work), but also with
>> the continue evolution of Beam and we would end up in the long living
>> branch mess that others runners have dealt with (e.g. the Apex runner)
>>
>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>> 22ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>
>> What do you think about this Amit ? Would you be ok to go with it if
>> JB's profile idea proves to help with the msintenance issues ?
>>
>> Ismaël
>>
>>
>>
>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
>>
>>> hbase-spark module doesn't use SparkSession. So situation there is
>>> simpler
>>> :-)
>>>
>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:
>>>
>>> I'm still wondering how we'll do this - it's not just different
 implementations of the same Class, but a completely different concepts
 such
 as using SparkSession in Spark 2 instead of
 SparkContext/StreamingContext
 in Spark 1.

 On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:

 I have done some work over in HBASE-16179 where compatibility modules
> are
> created to isolate changes in Spark 2.x API so that code in hbase-spark
> module can be reused.
>
> FYI
>
>

> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Beam spark 2.x runner status

2017-03-23 Thread Jean-Baptiste Onofré

Hi Kobi,

It's part of the plan yes. Let me push the branch on my github and share with 
you (rebasing).


Regards
JB

On 03/23/2017 08:20 AM, Kobi Salant wrote:

Hi,

We use SparkContext & SparkContextStreaming extensively in Spark runner to
create the Dsteams & Rdds so we will need to work on migrating from the 1.X
terms to 2.X terms (We may other incompatibilities that we will find out
during the work).

Regards
Kobi


2017-03-23 6:55 GMT+02:00 Jean-Baptiste Onofré :


Hi guys,

Ismaël summarize well what I have in mind.

I'm a bit late on the PoC around that (I started a branch already).
I will move forward over the week end.

Regards
JB


On 03/22/2017 11:42 PM, Ismaël Mejía wrote:


Amit, I suppose JB is talking about the RDD based version, so no need
to worry about SparkSession or different incompatible APIs.

Remember the idea we are discussing is to have in master both the
spark 1 and spark 2 runners using the RDD based translation. At the
same time we can have a feature branch to evolve the DataSet based
translator (this one will replace the RDD based translator for spark 2
once it is mature).

The advantages have been already discussed as well as the possible
issues so I think we have to see now if JB's idea is feasible and how
hard would be to live with this while the DataSet version evolves.

I think what we are trying to avoid is to have a long living branch
for a spark 2 runner based on RDD  because the maintenance burden
would be even worse. We would have to fight not only with the double
merge of fixes (in case the profile idea does not work), but also with
the continue evolution of Beam and we would end up in the long living
branch mess that others runners have dealt with (e.g. the Apex runner)

https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
22ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E

What do you think about this Amit ? Would you be ok to go with it if
JB's profile idea proves to help with the msintenance issues ?

Ismaël



On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:


hbase-spark module doesn't use SparkSession. So situation there is
simpler
:-)

On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:

I'm still wondering how we'll do this - it's not just different

implementations of the same Class, but a completely different concepts
such
as using SparkSession in Spark 2 instead of
SparkContext/StreamingContext
in Spark 1.

On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:

I have done some work over in HBASE-16179 where compatibility modules

are
created to isolate changes in Spark 2.x API so that code in hbase-spark
module can be reused.

FYI





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-23 Thread Amit Sela
If StreamingContext is valid and we don't have to use SparkSession, and
Accumulators are valid as well and we don't need AccumulatorsV2, I don't
see a reason this shouldn't work (which means there are still tons of
reasons this could break, but I can't think of them off the top of my head
right now).

@JB simply add a profile for the Spark dependencies and run the tests -
you'll have a very definitive answer ;-) .
If this passes, try on a cluster running Spark 2 as well.

Let me know of I can assist.

On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > Amit, I suppose JB is talking about the RDD based version, so no need
> > to worry about SparkSession or different incompatible APIs.
> >
> > Remember the idea we are discussing is to have in master both the
> > spark 1 and spark 2 runners using the RDD based translation. At the
> > same time we can have a feature branch to evolve the DataSet based
> > translator (this one will replace the RDD based translator for spark 2
> > once it is mature).
> >
> > The advantages have been already discussed as well as the possible
> > issues so I think we have to see now if JB's idea is feasible and how
> > hard would be to live with this while the DataSet version evolves.
> >
> > I think what we are trying to avoid is to have a long living branch
> > for a spark 2 runner based on RDD  because the maintenance burden
> > would be even worse. We would have to fight not only with the double
> > merge of fixes (in case the profile idea does not work), but also with
> > the continue evolution of Beam and we would end up in the long living
> > branch mess that others runners have dealt with (e.g. the Apex runner)
> >
> >
> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> >
> > What do you think about this Amit ? Would you be ok to go with it if
> > JB's profile idea proves to help with the msintenance issues ?
> >
> > Ismaël
> >
> >
> >
> > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
> >> hbase-spark module doesn't use SparkSession. So situation there is
> simpler
> >> :-)
> >>
> >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela 
> wrote:
> >>
> >>> I'm still wondering how we'll do this - it's not just different
> >>> implementations of the same Class, but a completely different concepts
> such
> >>> as using SparkSession in Spark 2 instead of
> SparkContext/StreamingContext
> >>> in Spark 1.
> >>>
> >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:
> >>>
>  I have done some work over in HBASE-16179 where compatibility modules
> are
>  created to isolate changes in Spark 2.x API so that code in
> hbase-spark
>  module can be reused.
> 
>  FYI
> 
> >>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Beam spark 2.x runner status

2017-03-23 Thread Kobi Salant
So, if everything is in place in Spark 2.X and we use provided dependencies
for Spark in Beam.
Theoretically, you can run the same code in 2.X without any need for a
branch?

2017-03-23 9:47 GMT+02:00 Amit Sela :

> If StreamingContext is valid and we don't have to use SparkSession, and
> Accumulators are valid as well and we don't need AccumulatorsV2, I don't
> see a reason this shouldn't work (which means there are still tons of
> reasons this could break, but I can't think of them off the top of my head
> right now).
>
> @JB simply add a profile for the Spark dependencies and run the tests -
> you'll have a very definitive answer ;-) .
> If this passes, try on a cluster running Spark 2 as well.
>
> Let me know of I can assist.
>
> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré 
> wrote:
>
> > Hi guys,
> >
> > Ismaël summarize well what I have in mind.
> >
> > I'm a bit late on the PoC around that (I started a branch already).
> > I will move forward over the week end.
> >
> > Regards
> > JB
> >
> > On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > > Amit, I suppose JB is talking about the RDD based version, so no need
> > > to worry about SparkSession or different incompatible APIs.
> > >
> > > Remember the idea we are discussing is to have in master both the
> > > spark 1 and spark 2 runners using the RDD based translation. At the
> > > same time we can have a feature branch to evolve the DataSet based
> > > translator (this one will replace the RDD based translator for spark 2
> > > once it is mature).
> > >
> > > The advantages have been already discussed as well as the possible
> > > issues so I think we have to see now if JB's idea is feasible and how
> > > hard would be to live with this while the DataSet version evolves.
> > >
> > > I think what we are trying to avoid is to have a long living branch
> > > for a spark 2 runner based on RDD  because the maintenance burden
> > > would be even worse. We would have to fight not only with the double
> > > merge of fixes (in case the profile idea does not work), but also with
> > > the continue evolution of Beam and we would end up in the long living
> > > branch mess that others runners have dealt with (e.g. the Apex runner)
> > >
> > >
> > https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> > >
> > > What do you think about this Amit ? Would you be ok to go with it if
> > > JB's profile idea proves to help with the msintenance issues ?
> > >
> > > Ismaël
> > >
> > >
> > >
> > > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
> > >> hbase-spark module doesn't use SparkSession. So situation there is
> > simpler
> > >> :-)
> > >>
> > >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela 
> > wrote:
> > >>
> > >>> I'm still wondering how we'll do this - it's not just different
> > >>> implementations of the same Class, but a completely different
> concepts
> > such
> > >>> as using SparkSession in Spark 2 instead of
> > SparkContext/StreamingContext
> > >>> in Spark 1.
> > >>>
> > >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:
> > >>>
> >  I have done some work over in HBASE-16179 where compatibility
> modules
> > are
> >  created to isolate changes in Spark 2.x API so that code in
> > hbase-spark
> >  module can be reused.
> > 
> >  FYI
> > 
> > >>>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [jira] [Commented] (BEAM-1261) State API should allow state to be managed in different windows

2017-03-23 Thread Robert Bradshaw
I like the idea of being able to use WindowMappingFns to access state
across windows in a manner similar to how side inputs are accessed.

On Wed, Mar 22, 2017 at 9:56 PM, Kenneth Knowles (JIRA) 
wrote:

>
> [ https://issues.apache.org/jira/browse/BEAM-1261?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=15937725#comment-15937725 ]
>
> Kenneth Knowles commented on BEAM-1261:
> ---
>
> This is an interesting model idea somewhat related to WindowMappingFn. I
> think we should first gain familiarity with both of those before tackling
> this, so I am going to set to unassigned.
>
> > State API should allow state to be managed in different windows
> > ---
> >
> > Key: BEAM-1261
> > URL: https://issues.apache.org/jira/browse/BEAM-1261
> > Project: Beam
> >  Issue Type: New Feature
> >  Components: beam-model, sdk-java-core
> >Reporter: Ben Chambers
> >Assignee: Kenneth Knowles
> >
> > For example, even if the elements are being processed in fixed windows
> of an hour, it may be desirable for the state to "roll over" between
> windows (or be available to all windows).
> > It will also be necessary to figure out when this state should be
> deleted (TTL? maximum retention?)
> > Another problem is how to deal with out of order data. If data comes in
> from the 10:00 AM window, should its state changes be visible to the data
> in the 9:00 AM window?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>


Re: Beam connector development for Hive as a data source

2017-03-23 Thread Madhusudan Borkar
Hi Davor,
Thanks for your response. I am working with my team. We have some questions
where we need little bit of help.
We are creating a pipeline where the source is hdfs. But when the pipeline
is run it can not find the hadoop host.
Do we need to configure before we run this pipeline? I could not find any
doc on hdfs except hdfs uri.
This is our code.
HDFSFileSource,LongWritable,Text> source =
HDFSFileSource.from("hdfs://hadoop-clust-0118-m:8020/tmp/puru/outputAllCols2039/part-m-0",
TextInputFormat.class, LongWritable.class, Text.class);

ource.validate();
p.apply(Read.from(source));
p.run().waitUntilFinish();
Error is host not found
I would appreciate your help.
I also sent request to join the forum. I am waiting for response.

regards,
Madhu Borkar
(c) (408) 390-9518

On Mon, Feb 6, 2017 at 5:41 PM, Davor Bonaci  wrote:

> Hi Madhu,
> Welcome! I suggest subscribing to the dev@ mailing list and using the
> same email address when sending to the list, to avoid your email being
> caught in moderation.
>
> It would be great to have a connector for Apache Hive. Keep in mind that
> several folks have expressed interest in using and contributing this
> connector. As far as I know, nobody is *actively* working on it, so you
> should be good to go. Please use BEAM-1158 [1] to coordinate this work with
> any other interested contributor.
>
> Note that there are several different ways of connecting Beam and Hive.
> The simplest one is to write HiveIO that which would run a Hive query and
> process Hive's results in Beam. Another would be to use Beam within Hive to
> compute the results of a Hive query. Finally, one could possibly write a
> Hive-based DSL on top of a Beam SDK.
>
> All of these approaches are valid and somewhat orthogonal one to another.
> I'm assuming you are after the first one. If so, and if you plan to follow
> already established patterns in other connectors, you don't necessarily
> need a design document. Otherwise, please start with a design document. We
> have linked a template in the Contribution Guide [2, 3].
>
> Once again, welcome and let us know if we can help in any way!
>
> Davor
>
> [1] https://issues.apache.org/jira/browse/BEAM-1158
> [2] https://beam.apache.org/contribute/contribution-guide/
> [3] https://docs.google.com/document/d/1qYQPGtabN5-
> E4MjHsecqqC7PXvJtXvZukPfLXQ8rHJs
>
> On Mon, Feb 6, 2017 at 4:27 PM, Madhusudan Borkar 
> wrote:
>
>> Hello,
>>
>> I am Big Data Architect working at eTouch Systems. We are GCP partners. We
>> are planning to contribute to Beam by developing a connector for Apache
>> Hive as a data source.
>> I understand that before any development work begins, we need to submit
>> our
>> design to Beam community.  I would like to request you to please share a
>> "design template" document for the same.  We will submit our design
>> document, using the template.
>>
>>
>> Thank you.
>>
>> best regards
>> Madhu Borkar
>>
>
>


IO IT Patterns: Simplifying data loading

2017-03-23 Thread Stephen Sisk
hi!

I just opened a jira ticket that I wanted to make sure the mailing list got
a chance to see.

The problem is that the current design pattern for doing data loading in IO
ITs (either writing a small program or using an external tool) is complex,
inefficient and requires extra steps like installing external
tools/probably using a VM. It also really doesn't scale well to the larger
data sizes we'd like to use for performance benchmarking.

My proposal is that instead of trying to test read and write separately,
the test should be a "write, then read back what you just wrote", all using
the IO being tested. To support scenarios like "I want to run my read test
repeatedly without re-writing the data", tests would add flags for
"skipCleanUp" and "useExistingData".

I think we've all likely seen this type of solution when testing storage
layers in the past, and I've previously shied away from it in this context,
but I think now that I've seen some real ITs and thought about scaling
them, in this case it's the right solution.

Please take a look at the jira if you have questions - there's a lot more
detail there.

S


Re: IO IT Patterns: Simplifying data loading

2017-03-23 Thread Ted Yu
Looks like you forgot to include JIRA number: BEAM-1799

Cheers

On Thu, Mar 23, 2017 at 4:26 PM, Stephen Sisk 
wrote:

> hi!
>
> I just opened a jira ticket that I wanted to make sure the mailing list got
> a chance to see.
>
> The problem is that the current design pattern for doing data loading in IO
> ITs (either writing a small program or using an external tool) is complex,
> inefficient and requires extra steps like installing external
> tools/probably using a VM. It also really doesn't scale well to the larger
> data sizes we'd like to use for performance benchmarking.
>
> My proposal is that instead of trying to test read and write separately,
> the test should be a "write, then read back what you just wrote", all using
> the IO being tested. To support scenarios like "I want to run my read test
> repeatedly without re-writing the data", tests would add flags for
> "skipCleanUp" and "useExistingData".
>
> I think we've all likely seen this type of solution when testing storage
> layers in the past, and I've previously shied away from it in this context,
> but I think now that I've seen some real ITs and thought about scaling
> them, in this case it's the right solution.
>
> Please take a look at the jira if you have questions - there's a lot more
> detail there.
>
> S
>


Re: IO IT Patterns: Simplifying data loading

2017-03-23 Thread Stephen Sisk
thanks, appreciated :)

On Thu, Mar 23, 2017 at 4:59 PM Ted Yu  wrote:

> Looks like you forgot to include JIRA number: BEAM-1799
>
> Cheers
>
> On Thu, Mar 23, 2017 at 4:26 PM, Stephen Sisk 
> wrote:
>
> > hi!
> >
> > I just opened a jira ticket that I wanted to make sure the mailing list
> got
> > a chance to see.
> >
> > The problem is that the current design pattern for doing data loading in
> IO
> > ITs (either writing a small program or using an external tool) is
> complex,
> > inefficient and requires extra steps like installing external
> > tools/probably using a VM. It also really doesn't scale well to the
> larger
> > data sizes we'd like to use for performance benchmarking.
> >
> > My proposal is that instead of trying to test read and write separately,
> > the test should be a "write, then read back what you just wrote", all
> using
> > the IO being tested. To support scenarios like "I want to run my read
> test
> > repeatedly without re-writing the data", tests would add flags for
> > "skipCleanUp" and "useExistingData".
> >
> > I think we've all likely seen this type of solution when testing storage
> > layers in the past, and I've previously shied away from it in this
> context,
> > but I think now that I've seen some real ITs and thought about scaling
> > them, in this case it's the right solution.
> >
> > Please take a look at the jira if you have questions - there's a lot more
> > detail there.
> >
> > S
> >
>