Re: [component] tag in JIRA tickets

2016-12-16 Thread Amit Sela
On Thu, Dec 15, 2016 at 6:15 AM, Jean-Baptiste Onofré > > wrote: > > > >> Yes, agree. We had kind of similar discussion while ago: > >> "java-sdk-extension" vs "io" afair ;) > >> > >> Regards > >> JB > >> >

Re: [component] tag in JIRA tickets

2016-12-15 Thread Amit Sela
also generic (for instance, > java-sdk-extension is for both extensions and IOs). > > If possible, I would more work on the component (and customize the > Release Notes output to include it). > > Regards > JB > > On 12/15/2016 02:30 PM, Amit Sela wrote: > > I took a look a

[component] tag in JIRA tickets

2016-12-15 Thread Amit Sela
I took a look at the release notes for 0.4.0-incubating now and I felt like it could have been "tagged" in a way that helps people focus on what's interesting to them Currently, all resolved issues simply appear as they are in JIRA, but we don't have any way to tag them. What if we were to prefix

Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Amit Sela
I see three problems in the release notes (related to Spark runner): Improvement: [BEAM-757] - The SparkRunner should utilize the SDK's DoFnRunner instead of writing it's own. [BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn [BEAM-855] - Remove the need for --streaming option in

Re: Jenkins build is still unstable: beam_PostCommit_Java_RunnableOnService_Spark #409

2016-12-14 Thread Amit Sela
This relates to my question in Dev list. On Wed, Dec 14, 2016 at 9:07 PM Kenneth Knowles wrote: > This is still https://issues.apache.org/jira/browse/BEAM-1149. We recently > added a test for it. The actual behavior has been broken for everyone for a > while. It is half-fixed by Eugene K. (some

New testSideInputsWithMultipleWindows and should DoFnRunner explode if DoFn contains a side input ?

2016-12-14 Thread Amit Sela
Hi all, Yesterday a new test was added to ParDoTest suite: "testSideInputsWithMultipleWindows". To the best of my understanding, it's meant to test sideInputs for elements in multiple windows (unexploded). The Spark runner uses the DoFnRunner (Simple) to process DoFns, and it will explode compres

Re: Beam Tuple

2016-12-13 Thread Amit Sela
data format extension). > > Regards > JB > > On 12/13/2016 11:06 AM, Amit Sela wrote: > > Hi all, > > > > I was wondering why Beam doesn't have tuples as part of the SDK ? > > To the best of my knowledge all currently supported (OSS) runners: Spark, >

Beam Tuple

2016-12-13 Thread Amit Sela
Hi all, I was wondering why Beam doesn't have tuples as part of the SDK ? To the best of my knowledge all currently supported (OSS) runners: Spark, Flink, Apex provide a Tuple abstraction and I was wondering if Beam should too ? Consider KV for example; it is a special ("*keyed*" by the first fie

Re: PCollection to PCollection Conversion

2016-12-13 Thread Amit Sela
t;>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles > >> >>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit > from a > &

Re: New DoFn and WindowedValue/WinowingInternals

2016-12-11 Thread Amit Sela
er - this is exactly what > you're looking for, a new DoFn that with per-runner support is able to emit > multi-windowed values. > On Sun, Dec 11, 2016 at 4:28 AM Amit Sela wrote: > > > Hi all, > > > > I've been working on migrating the Spark runner to new D

New DoFn and WindowedValue/WinowingInternals

2016-12-11 Thread Amit Sela
Hi all, I've been working on migrating the Spark runner to new DoFn and I've stumbled upon a couple of cases where OldDoFn is used in a way that accessed windowInternals (outputWindowedValue) such as AssignWindowsDoFn. Since changing windows is no longer the responsibility of DoFn I was wondering

Re: Jenkins build became unstable: beam_PostCommit_Java_RunnableOnService_Dataflow #1776

2016-12-09 Thread Amit Sela
I see: 502. That’s an error. The server encountered a temporary error and could not complete your request.Please try again in 30 seconds. That’s all we know. So I assume this is temporary... anyone with better Dataflow insights could probably provide more input. On Fri, Dec 9, 2016 at 7:32 P

Re: Performance Benchmarking Beam

2016-12-09 Thread Amit Sela
This is great Jason! Let me know if / how I can assist with Spark, or generally. Thanks, Amit On Thu, Dec 8, 2016 at 9:01 PM Jason Kuster wrote: > Hey all, > > So as I mentioned on Stephen's IO Testing thread a few days ago I've been > doing a bunch of investigating into performance testing fr

Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-08 Thread Amit Sela
+1 On Thu, Dec 8, 2016 at 1:27 PM Manu Zhang wrote: > +1 > > Manu > > On Thu, Dec 8, 2016 at 2:40 PM Tyler Akidau > wrote: > > > +1 > > > > On Thu, Dec 8, 2016 at 1:10 PM Jean-Baptiste Onofré > > wrote: > > > > > +1 > > > > > > Regards > > > JB > > > > > > On 12/07/2016 10:37 PM, Kenneth Knowl

Re: Increase stream parallelism after reading from UnboundedSource

2016-12-06 Thread Amit Sela
out "event-at-a-time" processors and everything in-between - such as bundles that may be of size 1, but might contain more elements. On Tue, Dec 6, 2016 at 10:18 PM Raghu Angadi wrote: > On Sun, Dec 4, 2016 at 11:48 PM, Amit Sela wrote: > > > For any downstream computat

Increase stream parallelism after reading from UnboundedSource

2016-12-04 Thread Amit Sela
Hi all, I have a general question about how stream-processing frameworks/engines usually behave in the following scenario: Say I have a Pipeline that consumes from 1 Kafka partition, so that my initial (optimal) parallelism is 1 as well. For any downstream computation, is it common for stream pr

Re: Jenkins build became unstable: beam_PostCommit_MavenVerify #1906

2016-11-26 Thread Amit Sela
Following build #1907 succeeded. Probably just a flake. I'll followup. On Sat, Nov 26, 2016, 13:46 Amit Sela wrote: > Seems to fail on DataflowRunner "WordCountIT.testE2EWordCount". > *Error*: > *Expected: Expected checksum is (508517575eba8d8d5a54f7f0080a00951cf

Fwd: Jenkins build became unstable: beam_PostCommit_MavenVerify #1906

2016-11-26 Thread Amit Sela
Seems to fail on DataflowRunner "WordCountIT.testE2EWordCount". *Error*: *Expected: Expected checksum is (508517575eba8d8d5a54f7f0080a00951cfe84ca)* * but: was (cfdcdcec05fc8424abc168bf5b0c0ed66e376547)* Anyone with access (and knowledge) to the Dataflow runner could take a look ? Thanks! ---

Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Amit Sela
+1, super exciting! Thanks to JB, Davor and the whole team for creating this community. I think we've achieved a lot in a short time. Amit. On Tue, Nov 22, 2016, 20:36 Tyler Akidau wrote: > +1, thanks to everyone who's invested time getting us to this point. :-) > > -Tyler > > On Tue, Nov 22,

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Amit Sela
Hi Stephen, I was wondering about how we plan to use the data stores across executions. Clearly, it's best to setup a new instance (container) for every test, running a "standalone" store (say HBase/Cassandra for example), and once the test is done, teardown the instance. It should also be agnost

Fwd: github mirroring is broken

2016-11-20 Thread Amit Sela
Courtesy of spark-dev ;-) -- Forwarded message - From: Reynold Xin Date: Sun, Nov 20, 2016 at 10:18 PM Subject: github mirroring is broken To: d...@spark.apache.org FYI Github mirroring from Apache's official git repo to GitHub is broken since Sat Nov 19, and as a result GitHub

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-17 Thread Amit Sela
FWIW I'm pretty sure this is Google's gs hdfs connector, and I think this should work for s3, and Azure's is here

Jenkins skipping PreCommit for a PR caused build failures on master.

2016-11-15 Thread Amit Sela
Jenkins had an outage which caused it to skip PreCommit execution for PR #1332 . I reviewed this PR, and merged it. I failed to notice that Jenkins skipped it, and so I merged it with a missing license and checkstyle errors. This spread-out acro

Re: Build failed in Jenkins: beam_PostCommit_MavenVerify #1827

2016-11-15 Thread Amit Sela
A new file is missing license. Weird... PreCommit passed. I'll open a ticket and fix this. On Tue, Nov 15, 2016 at 8:32 PM Jason Kuster wrote: > Investigating... > > On Tue, Nov 15, 2016 at 10:29 AM, Apache Jenkins Server < > jenk...@builds.apache.org> wrote: > > > See

Re: Configuring Jenkins

2016-11-15 Thread Amit Sela
Sweet! Versioning changes in a visible way can save a lot of pain.. Thanks Davor! On Tue, Nov 15, 2016, 18:47 Robert Bradshaw wrote: > This is great; thanks for doing this! > > On Tue, Nov 15, 2016 at 6:43 AM, Dan Halperin > wrote: > > Seems phenomenal! > > > > Reading between the lines of you

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Amit Sela
unner to provide a hint as to how many splits it > > wants. > > This brings it inline with how bounded sources work where the splitting > is > > performed once the pipeline has started. > > > > On Fri, Nov 11, 2016 at 8:09 AM, Amit Sela wrote: > > > >

Re: Introduction + contributing to docs

2016-11-11 Thread Amit Sela
Welcome Melissa! On Fri, Nov 11, 2016, 22:31 Jean-Baptiste Onofré wrote: > Hi Melissa, > > welcome aboard !! > > Regards > JB > > On 11/11/2016 08:11 PM, Melissa Pashniak wrote: > > Hello! > > > > > > My name is Melissa. I’ve previously been involved with Dataflow > > documentation, and I’m exci

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-11 Thread Amit Sela
+1 I think this makes more sense then the existing form of a split that is made of several Kafka partitions since, as mentioned, Kafka partitions are in fact it's parallelism. As for supporting a change in the number of partitions (mainly, added partitions), I'll suggest something I brought up bef

Re: SBT/ivy dependency issues

2016-11-10 Thread Amit Sela
> m: > classifier="tests"/> > ext= > "*" conf="" matcher="exact"/> > > "0.3.0-incubating" force="true" conf="test->runtime(*),master(compile)"> > conf="" matcher="

Re: SBT/ivy dependency issues

2016-11-10 Thread Amit Sela
@Abbass/Manu does SBT state that it's looking for *beam-sdks-java-core *& *beam-runners-core-java* but fails to find them ? On Thu, Nov 10, 2016 at 3:43 AM Manu Zhang wrote: > Hi all, > > I tried and reproduced the issue. "sbt-dependency-graph" doesn't show > beam-sdks-java-core > and beam-runn

Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-10 Thread Amit Sela
How about @ValidatesRunner ? Seems to complement @NeedsRunner as well. On Thu, Nov 10, 2016 at 9:47 AM Aljoscha Krettek wrote: > +1 > > What I would really like to see is automatic derivation of the capability > matrix from an extended Runner Test Suite. (As outlined in Thomas' doc). > > On Wed,

Re: PCollection to PCollection Conversion

2016-11-09 Thread Amit Sela
I think Jesse has a very good point on one hand, while Luke's and Kenneth's worries about committing users to specific implementations is in place. The Spark community has a 3rd party repository for useful libraries that for various reasons are not a part of the Apache Spark project: https://spark

Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-08 Thread Amit Sela
+1 awesome. Congrats Thomas! On Tue, Nov 8, 2016 at 3:54 PM Thomas Weise wrote: > Hi, > > As per previous discussion [1], I would like to propose to merge the > apex-runner branch into master. The runner satisfies the criteria outlined > in [2] and merging it to master will give more visibility

Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-03 Thread Amit Sela
obal window and triggers "never" this actually means that its > internal GBK emits all of its output exactly once, when the window is > expiring. So producing more than one output - even if spread across > microbatches - is actually the trouble. > > On Wed, Nov 2, 2016 at 10

Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Amit Sela
ll fail silently. (This was a side effect of > switching > >> from side inputs to groupbykey for this flow, which enabled testing of > >> triggers/panes/etc.) > >> > >> On Wed, Nov 2, 2016 at 5:19 AM, Amit Sela wrote: > >> > >> I've proposed ht

Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Amit Sela
this element should be removed. > > Regards > JB > > On 11/02/2016 10:53 AM, Amit Sela wrote: > > Hi all, > > > > I've been looking at PAssert and I've notice that PAssert.GroupedGlobally > > points > > < > https://github.com/apache/i

PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Amit Sela
Hi all, I've been looking at PAssert and I've notice that PAssert.GroupedGlobally points that it will result in a singe empty iterable even if the input PCollection is

Re: [ANNOUNCE] Beam 0.3.0-incubating Released

2016-10-31 Thread Amit Sela
Great! thanks Aljoscha! On Mon, Oct 31, 2016 at 6:58 PM Jean-Baptiste Onofré wrote: > Awesome, great !! > > Thanks Aljoscha for this release ! > Great job team ! > > Regards > JB > > On 10/31/2016 05:36 PM, Aljoscha Krettek wrote: > > Congratulations, team! I just finalised everything for the mo

Re: GitHub mirroring issue

2016-10-26 Thread Amit Sela
Thanks! On Wed, Oct 26, 2016, 23:32 Suneel Marthi wrote: > We have been seeing Github mirroring issues today on other projects too, > filed an Infra jira - INFRA-12830 > > On Wed, Oct 26, 2016 at 4:21 PM, Amit Sela wrote: > > > Hi all, > > > > I've merged

GitHub mirroring issue

2016-10-26 Thread Amit Sela
Hi all, I've merged a PR ~2 hours ago and while the apache remote seems up-to-date, github didn't nor did the PR or JIRA. The last commit hash is: 6db9424 (9f30b21 merge commit). Hopefully this will update after the next commit but FYI I guess. Thanks, Amit

Re: [DISCUSS] Merging master -> feature branch

2016-10-26 Thread Amit Sela
I generally agree with Kenneth. While working on the SparkRunnerV2 branch, it was a pain - i avoided frequent merges to avoid trivial PRs, but it cost me with very large and non-trivial merges later. I think that frequent merges for feature-branches should most of the time be trivial (no conflicts

Re: [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-26 Thread Amit Sela
+1 (binding) On Wed, Oct 26, 2016 at 10:53 AM Maximilian Michels wrote: > +1 (binding) > > Thanks for managing the release, Aljoscha! > > -Max > > > On Wed, Oct 26, 2016 at 6:46 AM, Jean-Baptiste Onofré > wrote: > > Agree. We already discussed about that on the mailing list. I mentionned > this

Re: build failed with dependency problems

2016-10-25 Thread Amit Sela
I just fetched and pulled latest master and build succeeded, maybe try again ? On Wed, Oct 26, 2016 at 9:19 AM Manu Zhang wrote: > Hi All, > > I tried to build latest master but failed with the following dependency > problems. > > [INFO] --- maven-dependency-plugin:2.10:analyze-only (default) @

Re: The Availability of PipelineOptions

2016-10-25 Thread Amit Sela
+1 On Tue, Oct 25, 2016 at 8:43 PM Robert Bradshaw wrote: > +1 > > On Tue, Oct 25, 2016 at 7:26 AM, Thomas Weise wrote: > > +1 > > > > > > On Tue, Oct 25, 2016 at 3:03 AM, Jean-Baptiste Onofré > > wrote: > > > >> +1 > >> > >> Agree > >> > >> Regards > >> JB > >> > >> ⁣ > >> > >> On Oct 25, 201

Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Amit Sela
SparkRunner status: V1 (Spark 1.6.x - DStream/RDD API): *Batch*: Full model support for batch, continuous ROS testing setup is in process now so that CI will validate constantly. *Streaming*: Supporting UnboundedSource is in review , starting to

Re: [DISCUSS] Deferring (pre) combine for merging windows.

2016-10-22 Thread Amit Sela
ndow)) until the main input window is known. On Fri, Oct 21, 2016 at 3:50 PM, Amit Sela wrote: > Please excuse my typos and apply "s/differ/defer/g" ;-). > Amit. > > On Fri, Oct 21, 2016 at 2:59 PM Amit Sela wrote: > >> I'd like to raise an issue that was disc

Re: [ANNOUNCEMENT] New committers!

2016-10-21 Thread Amit Sela
Congrats and welcome everyone! On Sat, Oct 22, 2016 at 2:04 AM Jesse Anderson wrote: > Thanks for the welcomes everyone! > > On Fri, Oct 21, 2016 at 4:02 PM Mark Liu > wrote: > > > Congrats for all of you! > > > > Mark > > > > On Fri, Oct 21, 2016 at 3:34 PM, Kenneth Knowles > > > wrote: > > >

Re: [DISCUSS] Deferring (pre) combine for merging windows.

2016-10-21 Thread Amit Sela
Please excuse my typos and apply "s/differ/defer/g" ;-). Amit. On Fri, Oct 21, 2016 at 2:59 PM Amit Sela wrote: > I'd like to raise an issue that was discussed in BEAM-696 > <https://issues.apache.org/jira/browse/BEAM-696>. > I won't recap here because

[DISCUSS] Deferring (pre) combine for merging windows.

2016-10-21 Thread Amit Sela
I'd like to raise an issue that was discussed in BEAM-696 . I won't recap here because it would be extensive (and probably exhaustive), and I'd also like to restart the discussion here rather then summarize it. *The problem* In the case of (main) inp

[DISCUSS] Executing (Jenkins) RunnableOnService tests more efficiently.

2016-10-20 Thread Amit Sela
Hi all, I'd like to discuss options to execute ROS tests (per runner) more efficiently, and explore the option of running them on PreCommit, as opposed to PostCommit as they run today. The SDK provides a set of tests called "RunnableOnService" (aka ROS) that can be applied to a runner and validat

Re: Start of release 0.3.0-incubating

2016-10-20 Thread Amit Sela
or > > [1] > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338051 > > On Thu, Oct 20, 2016 at 9:32 AM, Amit Sela wrote: > > > +1 > > > > I would like to have my standing PRs merged please - they should

Re: Start of release 0.3.0-incubating

2016-10-20 Thread Amit Sela
+1 I would like to have my standing PRs merged please - they should provide support for UnboundedSource for the SparkRunner. If it won't be ready for merge at the beginning of next week, don't hold for me. Thanks, Amit On Thu, Oct 20, 2016 at 7:27 PM Jean-Baptiste Onofré wrote: > +1 > > Thanks

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
mes as part of one of > kafka libs. > > On Wed, Oct 19, 2016 at 10:49 PM, Amit Sela wrote: > > > The SparkRunner actually has an embedded Kafka for its unit tests. > > > > On Wed, Oct 19, 2016, 20:16 Thomas Weise wrote: > > > > > Kafka can be em

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
out the best > ways > > to build these tests and necessary test infrastructure. (See the > > performance thread Jason started. IMO the most important issue to solve > > first is infrastructure). Please help! > > > > Dan > > > > On Wed, Oct 19, 2016 at 7:37 AM, Th

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
gt; Thanks for the update and summary. > > > > > > Regards > > > JB > > > > > > ⁣ > > > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela > > wrote: > > >>I wanted to summarize here a couple of important points ra

[DISCUSS] Sources and Runners

2016-10-18 Thread Amit Sela
I wanted to summarize here a couple of important points raised in some PRs I was involved with, and while those PRs were about KafkaIO and related to the Spark/Direct runners, some important notes were made and I think they are worth this thread. *Background*: The KafkaIO waits (5 seconds) before

Re: Exploring Performance Testing

2016-10-18 Thread Amit Sela
@Jesse how about runners "tracing" the constructed DAG (by Beam) so that it's clear what the runner actually executed ? Example: For the SparkRunner, a ParDo translates to a mapPartitions transformation. That could provide transparency when debugging/benchmarking pipelines per-runner. On Tue, Oc

Re: Introduction

2016-10-17 Thread Amit Sela
Done. Feel free to take a pick at the Spark runner since you have Spark experience and that's great! Most open issues are usually automatically assigned to me, but ping me (dev list/Slack) if you want to work on something and not sure what's the status there. Thanks, Amit On Mon, Oct 17, 2016

Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Amit Sela
Congrats and thanks to everyone who was involved in this effort! On Mon, Oct 17, 2016 at 8:07 PM Neelesh Salian wrote: > Awesome. Great work. > > On Mon, Oct 17, 2016 at 10:03 AM, Aljoscha Krettek > wrote: > > > Congrats! :-) > > > > On Mon, 17 Oct 2016 at 18:55 Kenneth Knowles > > wrote: > >

Re: Simplifying User-Defined Metrics in Beam

2016-10-13 Thread Amit Sela
On Thu, Oct 13, 2016 at 12:27 PM Aljoscha Krettek wrote: > I finally found the time to have a look. :-) > > The API looks very good! (It's very similar to an API we recently added to > Flink, which is inspired by the same Codahale/Dropwizard metrics). > > About the semantics, the "A", "B" and "C"

Re: Simplifying User-Defined Metrics in Beam

2016-10-13 Thread Amit Sela
+1 for the new metrics design, I especially like the Dropwizard-like API which will probably be more familiar to users -Aggregators can be misleading for example to Spark users, Spark Aggregators are Beam Accumulators and Spark Accumulators are Beam Aggregators... Just take into account that runne

Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-10-13 Thread Amit Sela
ondering could any people help > > on them? > > > > https://issues.apache.org/jira/browse/BEAM-596 > > https://issues.apache.org/jira/browse/BEAM-595 > > https://issues.apache.org/jira/browse/BEAM-593 > > > > Thanks > > -- > > Pei > > >

Re: Introducing a Redistribute transform

2016-10-11 Thread Amit Sela
namic work rebalancing, backups and other magic. So I decided not to > > bother with this in that PR. > > > > Agreed with Robert that limiting the parallelism, or throttling, are very > > useful features, but Redistribute is not the right place to introduce them. > Yo

Re: Introducing a Redistribute transform

2016-10-10 Thread Amit Sela
On Mon, Oct 10, 2016 at 9:21 PM Robert Bradshaw wrote: > On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela wrote: > > > Hi Eugene, > > > > > > This is very interesting. > > > Let me see if I get this right, the "Redistribute" transformation > a

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
on https://s.apache.org/splittable-do-fn? > > On Mon, Oct 10, 2016 at 6:35 AM, Amit Sela wrote: > > Thanks Max! > > > > I'll try to explain Spark's stateful operators and how/why I used them > with > > UnboundedSource. > > > > Spark has tw

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
> thanks for the explanation. > > > > > > For 4, you are right, it's slightly different from DataXchange (related > to > > > the elements in the PCollection). I think storing the "starting point" > for a > > > reader makes sense. > >

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
no records were read (yet), so that the next time the reader attempts to read it will pick of there. This has more to do with how the CheckpointMark handles this. I have to say that I'm not familiar with your DataXchange proposal, I will take a look though. > > > > Regards >

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-09 Thread Amit Sela
e because I have to bound the read somehow or the micro-batch will grow indefinitely. > > I will take a look at the Spark runner.. do you have a branch that you are > working with? > > Raghu. > > > On Sat, Oct 8, 2016 at 11:33 AM, Amit Sela wrote: > > > Some answers

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-08 Thread Amit Sela
Some answers inline. @Raghu I'll review the PR tomorrow. Thanks, Amit On Sat, Oct 8, 2016 at 3:47 AM Raghu Angadi wrote: > On Fri, Oct 7, 2016 at 4:55 PM, Amit Sela wrote: > > >3. Support reading of Kafka partitions that were added to topic/s > while > >

Re: Introducing a Redistribute transform

2016-10-08 Thread Amit Sela
Hi Eugene, This is very interesting. Let me see if I get this right, the "Redistribute" transformation assigns a "running id" key (per-bundle) , calls "Redistribute.byKey", and extracts back the values, correct ? As for "Redistribute.byKey" - it's made of a GroupByKey transformation that follows

[DISCUSS] UnboundedSource and the KafkaIO.

2016-10-07 Thread Amit Sela
I started a thread about (suggesting) UnboundedSource splitId's and it turned into an UnboundedSource/KafkaIO discussion, and I think it's best to start over in a clear [DISCUSS] thread. When working on UnboundedSource support for the Spark runner, I've raised some questions, some were general-Unb

Re: Should UnboundedSource provide a split identifier ?

2016-10-07 Thread Amit Sela
a YARN container as well. On Tue, Oct 4, 2016 at 10:39 PM Raghu Angadi wrote: > On Wed, Sep 14, 2016 at 1:43 PM, Amit Sela wrote: > > > > > > > For generateInitialSplits, the UnboundedSource API doesn't require > > > deterministic splitting (although it&#x

Re: Preferred locations (or data locality) for batch pipelines.

2016-10-03 Thread Amit Sela
class. See here: https://github.com/apache/ > incubator-beam/blob/master/sdks/java/io/hdfs/src/main/ > <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/hdfs/src/main/> > java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L180 > > If that's right, makes sen

Re: We've hit 1000 PRs!

2016-09-27 Thread Amit Sela
That's truly amazing for seven months of work! On Tue, Sep 27, 2016, 22:39 Henry Saputra wrote: > Congrats! > > On Tuesday, September 27, 2016, Jean-Baptiste Onofré > wrote: > > > Amazing. > > > > Congrats to the whole team. It's a great sign about the project activity. > > > > Thanks all ! > >

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-26 Thread Amit Sela
from this Kafka cluster, let's try to be near > it". For example. > > Does that answer the question / did I miss something? > > Thanks, > Dan > > On Thu, Sep 22, 2016 at 8:29 AM, Amit Sela wrote: > > > Generally this makes sense, though I thought that thi

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
:11 PM Jesse Anderson wrote: > I think the runners should. Each framework has put far more effort into > data locality than Beam should. Beam should just take advantage of it. > > On Thu, Sep 22, 2016, 7:57 AM Amit Sela wrote: > > > Not where in the file, where in the clus

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
ea... On Thu, Sep 22, 2016 at 5:45 PM Jesse Anderson wrote: > I've only ever seen that being used to figure out which file the > runner/mapper/operation is working on. Otherwise, I haven't seen those > operations care where in the file they're working. > > On Thu,

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
emove IOChannelFactory, then I would suggest it's > a runner concern. The Read.Bounded should "locate" the bundles on a > executor close to the read data (even if it's not always possible > depending of the source). > > My $0.01 > > Regards > JB > > On 0

Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
It's not new that batch pipeline can optimize on data locality, my question is regarding this responsibility in Beam. If runners should implement a generic Read.Bounded support, should they also implement locating the input blocks ? or should it be a part of IOChannelFactory implementations ? or an

Re: IntervalWindow toString()

2016-09-19 Thread Amit Sela
Looks like included, excluded bounds. On Mon, Sep 19, 2016, 18:57 Jesse Anderson wrote: > The toString() to IntervalWindow starts with a square bracket and ends with > a parenthesis. Is this a type of notation or a bug? Code: > > @Override > public String toString() { > return "[" + star

Re: FYI: All Runners Tested In Precommit

2016-09-15 Thread Amit Sela
That's great news. Thanks Jason! On Thu, Sep 15, 2016 at 10:16 PM Jason Kuster wrote: > Hi all, > > Just a quick update -- as of yesterday all new PRs now run the WordCount > end-to-end test against every runner in master (Flink, Spark, Dataflow, and > Direct). This is a great milestone for test

Re: Hi

2016-09-15 Thread Amit Sela
Welcome Etienne! On Thu, Sep 15, 2016, 14:34 Jean-Baptiste Onofré wrote: > Hi Etienne, > > welcome aboard (again ;)) ! > > Regards > JB > > On 09/15/2016 10:19 AM, Etienne Chauchot wrote: > > Hi guys, > > > > I'll be working with JB on Beam. I'm just starting, for now I'm writing > > samples on

Re: Should UnboundedSource provide a split identifier ?

2016-09-14 Thread Amit Sela
> should keep track of the initially generated splits. > If the splitting were to be consistent, in such way that newly added partitions would be assigned with a new "splitId" while existing ones would still be assigned with the same (consistent) splitId, it could support newly added pa

Re: Should UnboundedSource provide a split identifier ?

2016-09-13 Thread Amit Sela
If I understand correctly this will break https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L857 in KafkaIO. So it's a KafkaIO limitation (for now ?) ? On Tue, Sep 13, 2016 at 11:31 AM Amit Sela wrote: >

Re: Should UnboundedSource provide a split identifier ?

2016-09-13 Thread Amit Sela
va#L164 > and > > https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L334 > ) > > On Mon, Sep 12, 2016 at 11:40 AM, Amit Sela wrote: > > > If

Re: Should UnboundedSource provide a split identifier ?

2016-09-12 Thread Amit Sela
If this issue doesn't make sense for "native" streaming systems, and it's only a Spark issue (and my implementation of Read.Unbounded) - I could keep doing what I do, use a running id. I was just wondering... ( hence the question mark in the title ;-) ) On Mon, Sep 12, 2016

Re: Should UnboundedSource provide a split identifier ?

2016-09-12 Thread Amit Sela
atest checkpoint. On Mon, Sep 12, 2016 at 9:15 PM Raghu Angadi wrote: > On Wed, Sep 7, 2016 at 7:13 AM, Amit Sela wrote: > > > @Raghu, hashing is exactly what I mean, but I'm asking > > if it should be abstracted in the Source.. Otherwise, runners will have > to > &

Re: Should UnboundedSource provide a split identifier ?

2016-09-07 Thread Amit Sela
in partitions. > > On Tue, Sep 6, 2016 at 4:04 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Oh! Okay, looks like this is a part of the code I was unfamiliar with. > I'd > > like to know the answer too, in this case. > > +Daniel Mills

Re: Should UnboundedSource provide a split identifier ?

2016-09-06 Thread Amit Sela
checkpointing each of them independently", so their order > shouldn't matter. > > On Tue, Sep 6, 2016 at 3:17 PM Amit Sela wrote: > > > UnboundedSources generate initial splits, which are basically splits of > > them selves - for example, if an UnboundedKafkaSource is set to

Should UnboundedSource provide a split identifier ?

2016-09-06 Thread Amit Sela
UnboundedSources generate initial splits, which are basically splits of them selves - for example, if an UnboundedKafkaSource is set to read from topic T1 which is made of 2 partitions P1 and P2, it will (optimally) split into two UnboundedKafkaSource, one per partition. During the lifecycle of the

Re: Remove legacy import-order?

2016-08-24 Thread Amit Sela
+1 on import order as well. Kenneth has a good point about history if we reformat. On Wed, Aug 24, 2016, 18:59 Kenneth Knowles wrote: > +1 to import order > > I don't care about actually enforcing formatting, but would add it to IDE > tips and just make it an "OK topic for code review". Enforcin

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-12 Thread Amit Sela
+1 as in I'll join ;-) On Fri, Aug 12, 2016, 19:14 Eugene Kirpichov wrote: > Sounds good, thanks! > Then Friday Aug 19th it is, 8am-9am PST, > https://staging.talkgadget.google.com/hangouts/_/google.com/splittabledofn > > On Thu, Aug 11, 2016 at 11:12 PM Jean-Baptiste Onofré > wrote: > > > Hi >

Re: [Proposal] Pipelines and their executions naming changes.

2016-08-09 Thread Amit Sela
Currently, the Spark runner extends ApplicationNameOptions, PipelineOptions and StreamingOptions. Any unification of naming conventions is great IMO, and the runner will inherit them as it is. As for appName/pipelineName - appName is the same as Spark's app name, but I can live happily with pipelin

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-08 Thread Amit Sela
ficient override for this transform too; worst > case, we'd have two overloads - one for a fixed filepattern and one for a > PCollection of filepatterns, only one of these efficiently overridden by > the Spark runner. Does this make sense? > > Thanks. > > On Mon, Aug 8, 2016 at

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-08 Thread Amit Sela
Hi Eugene, I really like the proposal, especially the part of embedding a non-Beam job and export jobs prior to pipeline execution - up until now, such work would have been managed by some 3rd party orchestrator that monitors the end of the prepending job, and then executes the pipeline. Having th

Re: Proposal: Dynamic PIpelineOptions

2016-08-07 Thread Amit Sela
+1 sounds like a good idea. Spark's driver actually takes all dynamic parameters starting with "spark." and propagates them into SparkConf which is propagated onto the Executors and is available via the environment's SparkEnv. I'm wondering, does this mean that PipelineOption will be available to

Re: [PROPOSAL] Having 2 Spark runners to support Spark 1 users while advancing towards better streaming implementation with Spark 2

2016-08-05 Thread Amit Sela
spark 1) ? > > > - We must probably make clear for users the advantages/disadvantages of > > > both versions of the runner, and make clear that the spark 1 runner > will > > be > > > almost on maintenance mode (with not many new features). > > > - We mu

[PROPOSAL] Having 2 Spark runners to support Spark 1 users while advancing towards better streaming implementation with Spark 2

2016-08-03 Thread Amit Sela
After discussions with JB, and understanding that a lot of companies running Spark will probably run 1.6.x for a while, we thought it would be a good idea to have (some) support for both branches. The SparkRunnerV1 will mostly support Batch, but could also support “KeyedState” workflows and Sessio

Re: [REFLECT] Beam’s Half Birthday!

2016-08-01 Thread Amit Sela
Sounds great! The only "numbers" I can think of that are missing are git stars/forks. Thanks, Amit On Mon, Aug 1, 2016 at 6:36 PM Aljoscha Krettek wrote: > +1 > > This sounds very good, I can't come up with anything that you missed. > > On Mon, 1 Aug 2016 at 08:00 Jean-Baptiste Onofré wrote:

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

2016-07-28 Thread Amit Sela
ARN. The question is, should the tests be run on > 'Standalone' > > OR YARN' or maybe we can have tests for 'Standalone AND YARN' ? > > > > Ismael. > > > > > > > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela > wrote

  1   2   >