Re: Want to contribute to Beam project

2017-04-01 Thread Jean-Baptiste Onofré

Hi Tarush,

welcome aboard !

You can take a look on https://beam.apache.org/contribute/.

Any contribution is valuable (not only code): documentation, etc.

I propose to you to take a look on the Jira, experiment Beam to find new 
features/improvement, and be involved on the mailing list.


Regards
JB

On 04/01/2017 09:59 PM, tarush grover wrote:

Hi Members,

Let me introduce myself I am Tarush Grover with 3 years working in the big
data technologies as senior software engineer. I find Apache Beam to be an
exciting project.

I request community members to please involve me in this exciting journey.
Please guide me to where and how to start so that I can quickly pace with
the active development and it would be great if you can assign something to
me to start.

Regards,
Tarush



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] ORC support

2017-04-01 Thread Jean-Baptiste Onofré

+1

By the way, around the same topic, I'm working on Apache CarbonData support 
(http://carbondata.apache.org/).


Regards
JB

On 04/01/2017 05:31 PM, Tibor Kiss wrote:

Hello,

Recently the Optimized Row Columnar (ORC) file format was spin off from Hive
and became a top level Apache Project: https://orc.apache.org/

It is similar to Parquet in a sense that it uses column major format but
ORC has
a more elaborate type system and stores basic statistics about each row.

I'd be interested extending Beam with ORC support if others find it helpful
too.

What do you think?

- Tibor



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Want to contribute to Beam project

2017-04-01 Thread tarush grover
Hi Members,

Let me introduce myself I am Tarush Grover with 3 years working in the big
data technologies as senior software engineer. I find Apache Beam to be an
exciting project.

I request community members to please involve me in this exciting journey.
Please guide me to where and how to start so that I can quickly pace with
the active development and it would be great if you can assign something to
me to start.

Regards,
Tarush


Re: [PROPOSAL] ORC support

2017-04-01 Thread Ismaël Mejía
+1

>From my previous work experience ORC in certain cases performs better
than Parquet and really deserves to be supported.


On Sat, Apr 1, 2017 at 5:58 PM, Ted Yu  wrote:
> +1
>
>> On Apr 1, 2017, at 8:31 AM, Tibor Kiss  wrote:
>>
>> Hello,
>>
>> Recently the Optimized Row Columnar (ORC) file format was spin off from Hive
>> and became a top level Apache Project: https://orc.apache.org/
>>
>> It is similar to Parquet in a sense that it uses column major format but
>> ORC has
>> a more elaborate type system and stores basic statistics about each row.
>>
>> I'd be interested extending Beam with ORC support if others find it helpful
>> too.
>>
>> What do you think?
>>
>> - Tibor


Re: Update of Pei in Alibaba

2017-04-01 Thread Ismaël Mejía
Excellent news,

Pei it would be great to have a new runner. I am curious about how
different are the implementations of storm among them considering that
there are already three 'versions': Storm, Jstorm and Heron, I wonder
if one runner could traduce to an API that would cover all of them (of
course maybe I am super naive I really don't know much about JStorm or
Heron and how much they differ from the original storm).

Jingson, I am super curious about this Galaxy project, it is there any
public information about it? is this related to the previous blink ali
baba project? I already looked a bit but searching "Ali baba galaxy"
is a recipe for a myriad of telephone sellers :)

Nice to see that you are going to keep contributing to the project Pei.

Regards,
Ismaël



On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss  wrote:
> Exciting times, looking forward to try it out!
>
> I shall mention that Taylor Goetz also started creating a BEAM runner using
> Storm.
> His work is available in the storm repo:
> https://github.com/apache/storm/commits/beam-runner
> Maybe it's worth while to take a peek and see if something is reusable from
> there.
>
> - Tibor
>
> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee  wrote:
>
>> Wow, very glad to see JStorm also started building BeamRunner.
>> I am working in Galaxy (Another streaming process engine in Alibaba).
>> I hope that we can work together to promote the use of Apache Beam
>> in Alibaba and China.
>>
>> best,
>> JingsongLee
>> --From:Pei
>> HE Time:2017 Apr 1 (Sat) 09:24To:dev <
>> dev@beam.apache.org>Subject:Update of Pei in Alibaba
>> Hi all,
>> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>> And, I want to give an update of things in here.
>>
>> A colleague and I have been working on JStorm
>>  runner. We have a prototype that works
>> with WordCount and PAssert. (I am going to start a separate email thread
>> about how to get it reviewed and merged in Apache Beam.)
>> We also have Spark clusters, and are very interested in
>> using Spark runner.
>>
>> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
>> Beam model. While many companies gave talks of their
>> in-house solutions for
>> unified batch and unified SQL, there are also lots of interests
>> and enthusiasts of Beam.
>>
>> Looking forward to chat more.
>> --
>> Pei
>>
>>
>
>
> --
> Kiss Tibor
>
> +36 70 275 9863
> tibor.k...@gmail.com


Re: [PROPOSAL] ORC support

2017-04-01 Thread Ted Yu
+1

> On Apr 1, 2017, at 8:31 AM, Tibor Kiss  wrote:
> 
> Hello,
> 
> Recently the Optimized Row Columnar (ORC) file format was spin off from Hive
> and became a top level Apache Project: https://orc.apache.org/
> 
> It is similar to Parquet in a sense that it uses column major format but
> ORC has
> a more elaborate type system and stores basic statistics about each row.
> 
> I'd be interested extending Beam with ORC support if others find it helpful
> too.
> 
> What do you think?
> 
> - Tibor


Re: Update of Pei in Alibaba

2017-04-01 Thread Tibor Kiss
Exciting times, looking forward to try it out!

I shall mention that Taylor Goetz also started creating a BEAM runner using
Storm.
His work is available in the storm repo:
https://github.com/apache/storm/commits/beam-runner
Maybe it's worth while to take a peek and see if something is reusable from
there.

- Tibor

On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee  wrote:

> Wow, very glad to see JStorm also started building BeamRunner.
> I am working in Galaxy (Another streaming process engine in Alibaba).
> I hope that we can work together to promote the use of Apache Beam
> in Alibaba and China.
>
> best,
> JingsongLee
> --From:Pei
> HE Time:2017 Apr 1 (Sat) 09:24To:dev <
> dev@beam.apache.org>Subject:Update of Pei in Alibaba
> Hi all,
> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
> And, I want to give an update of things in here.
>
> A colleague and I have been working on JStorm
>  runner. We have a prototype that works
> with WordCount and PAssert. (I am going to start a separate email thread
> about how to get it reviewed and merged in Apache Beam.)
> We also have Spark clusters, and are very interested in
> using Spark runner.
>
> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
> Beam model. While many companies gave talks of their
> in-house solutions for
> unified batch and unified SQL, there are also lots of interests
> and enthusiasts of Beam.
>
> Looking forward to chat more.
> --
> Pei
>
>


-- 
Kiss Tibor

+36 70 275 9863
tibor.k...@gmail.com


Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-04-01 Thread Eugene Kirpichov
Hey all,

The Flink PR has been merged, and thus - Flink becomes the first
distributed runner to support Splittable DoFn!!!
Thank you, Aljoscha!

Looking forward to Spark and Apex, and continuing work on Dataflow.
I'll also send proposals about a couple of new ideas related to SDF next
week.

On Thu, Mar 30, 2017 at 9:08 AM Amit Sela  wrote:

> I will not be able to make it this weekend, too busy. Let's chat at the
> beginning of next week and see what's on my plate.
>
> On Tue, Mar 28, 2017 at 5:44 PM Aljoscha Krettek 
> wrote:
>
> > Thanks for the offers, guys! The code is finished, though. I only need
> > to do the last touch ups.
> >
> > On Tue, Mar 28, 2017, at 09:16, JingsongLee wrote:
> > > Hi Aljoscha,
> > > I would like to work on the Flink runner with you.
> > >
> >
> Best,JingsongLee--From:Jean-Baptiste
> > > Onofré Time:2017 Mar 28 (Tue) 14:04To:dev
> > > Subject:Re: Call for help: let's add Splittable
> > DoFn
> > > to Spark, Flink and Apex runners
> > > Hi Aljoscha,
> > >
> > > do you need some help on this ?
> > >
> > > Regards
> > > JB
> > >
> > > On 03/28/2017 08:00 AM, Aljoscha Krettek wrote:
> > > > Hi,
> > > > sorry for being so slow but I’m currently traveling.
> > > >
> > > > The Flink code works but I think it could benefit from some
> refactoring
> > > > to make the code nice and maintainable.
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > > On Tue, Mar 28, 2017, at 07:40, Jean-Baptiste Onofré wrote:
> > > >> I add myself on the Spark runner.
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >> On 03/27/2017 08:18 PM, Eugene Kirpichov wrote:
> > > >>> Hi all,
> > > >>>
> > > >>> Let's continue the ~bi-weekly sync-ups about state of SDF support
> in
> > > >>> Spark/Flink/Apex runners.
> > > >>>
> > > >>> Spark:
> > >
> > >>> Amit, Aviem, Ismaël - when would be a good time for you; does same
> time
> > > >>> work (8am PST this Friday)? Who else would like to join?
> > > >>>
> > > >>> Flink:
> > > >>> I pinged the PR, but - Aljoscha, do you think it's worth discussing
> > > >>> anything there over a videocall?
> > > >>>
> > > >>> Apex:
> > >
> > >>> Thomas - how about same time next Monday? (9:30am PST) Who else
> would like
> > > >>> to join?
> > > >>>
> > > >>> On Mon, Mar 20, 2017 at 9:59 AM Eugene Kirpichov <
> > kirpic...@google.com>
> > > >>> wrote:
> > > >>>
> > >  Meeting notes:
> > >  Me and Thomas had a video call and we pretty much walked through
> the
> > >
> >  implementation of SDF in the runner-agnostic part and in the direct
> runner.
> > >  Flink and Apex are pretty similar, so likely
> > >  https://github.com/apache/beam/pull/2235
> >  (the Flink PR) will give a very
> > >  good guideline as to how to do this in Apex.
> > >  Will talk again in ~2 weeks; and will involve +David Yan
> > >   > > who is also on Apex and currently conveniently
> > >
> >  works on the Google Dataflow team and, from in-person conversation,
> was
> > >  interested in being involved :)
> > > 
> > >  On Mon, Mar 20, 2017 at 7:34 AM Eugene Kirpichov <
> > kirpic...@google.com>
> > >  wrote:
> > > 
> > >  Thomas - yes, 9:30 works, shall we do that?
> > > 
> > >
> >  JB - excellent! You can start experimenting already, using direct
> runner!
> > > 
> > >  On Mon, Mar 20, 2017, 2:26 AM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > >  wrote:
> > > 
> > >  Hi Eugene,
> > > 
> > >  Thanks for the meeting notes !
> > > 
> > >
> >  I will be in the next call and Ismaël also provided to me some
> updates.
> > > 
> > >
> >  I will sync with Amit on Spark runner and start to experiment and
> test SDF
> > >  on
> > >  the JMS IO.
> > > 
> > >  Thanks !
> > >  Regards
> > >  JB
> > > 
> > >  On 03/17/2017 04:36 PM, Eugene Kirpichov wrote:
> > > > Meeting notes from today's call with Amit, Aviem and Ismaël:
> > > >
> > > > Spark has 2 types of stateful operators; a cheap one intended for
> > >  updating
> > >
> > > elements (works with state but not with timers) and an expensive
> one.
> > >  I.e.
> > >
> > > there's no efficient direct counterpart to Beam's keyed state
> model. In
> > >
> > > implementation of Beam State & Timers API, Spark runner will use
> the
> > >
> > > cheaper one for state and the expensive one for timers. So, for
> SDF,
> > >  which
> > >
> > > in the runner-agnostic SplittableParDo expansion needs both state
> and
> > >
> > > timers, we'll need the expensive one - but this should be fine
> since with
> > >
> > > SDF the bottleneck should be in the ProcessElement call itself,
> not in
> > > > splitting/scheduling it.
> > > >
> > >
> > > For Spark batch runner, implementing SDF might be still