Re: KafkaIO Windowing Fn

2016-08-25 Thread Chawla,Sumit
Hi Thomas

I am using FlinkRunner.  Yes the second part of trigger never fires for me,

Regards
Sumit Chawla


On Thu, Aug 25, 2016 at 4:18 PM, Thomas Groh 
wrote:

> Hey Sumit;
>
> What runner are you using? I can set up a test with the same trigger
> reading from an unbounded input using the DirectRunner and I get the
> expected output panes.
>
> Just to clarify, the second half of the trigger ('when the first element
> has been there for at least 30+ seconds') simply never fires?
>
> On Thu, Aug 25, 2016 at 2:38 PM, Chawla,Sumit 
> wrote:
>
> > Hi Thomas
> >
> > That did not work.
> >
> > I tried following instead:
> >
> > .triggering(
> > Repeatedly.forever(
> > AfterFirst.of(
> >   AfterProcessingTime.
> pastFirstElementInPane()
> > .plusDelayOf(Duration.standard
> > Seconds(30)),
> >   AfterPane.elementCountAtLeast(100)
> > )))
> > .discardingFiredPanes()
> >
> > What i am trying to do here.  This is to make sure that followup
> > operations receive batches of records.
> >
> > 1.  Fire when at Pane has 100+ elements
> >
> > 2.  Or Fire when the first element has been there for atleast 30 sec+.
> >
> > However,  2 point does not seem to work.  e.g. I have 540 records in
> > Kafka.  The first 500 records are available immediately,
> >
> > but the remaining 40 don't pass through. I was expecting 2nd to
> > trigger to help here.
> >
> >
> >
> >
> >
> >
> >
> > Regards
> > Sumit Chawla
> >
> >
> > On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh 
> > wrote:
> >
> > > You can adjust the trigger in the windowing transform if your sink can
> > > handle being written to multiple times for the same window. For
> example,
> > if
> > > the sink appends to the output when it receives new data in a window,
> you
> > > could add something like
> > >
> > > Window.into(...).withAllowedLateness(...).triggering(AfterWatermark.
> > > pastEndOfWindow().withEarlyFirings(AfterProcessingTime.
> > > pastFirstElementInPane().withDelayOf(Duration.standardSeconds(5))).
> > > withLateFirings(AfterPane.elementCountAtLeast(1))).discardin
> > gFiredPanes();
> > >
> > > This will cause elements to be output some amount of time after they
> are
> > > first received from Kafka, even if Kafka does not have any new
> elements.
> > > Elements will only be output by the GroupByKey once.
> > >
> > > We should still have a JIRA to improve the KafkaIO watermark tracking
> in
> > > the absence of new records .
> > >
> > > On Thu, Aug 25, 2016 at 10:29 AM, Chawla,Sumit  >
> > > wrote:
> > >
> > > > Thanks Raghu.
> > > >
> > > > I don't have much control over changing KafkaIO properties.  I added
> > > > KafkaIO code for completing the example.  Are there any changes that
> > can
> > > be
> > > > done to Windowing to achieve the same behavior?
> > > >
> > > > Regards
> > > > Sumit Chawla
> > > >
> > > >
> > > > On Wed, Aug 24, 2016 at 5:06 PM, Raghu Angadi
> >  > > >
> > > > wrote:
> > > >
> > > > > The default implementation returns processing timestamp of the last
> > > > record
> > > > > (in effect. more accurately it returns same as getTimestamp(),
> which
> > > > might
> > > > > overridden by user).
> > > > >
> > > > > As a work around, yes, you can provide your own watermarkFn that
> > > > > essentially returns Now() or Now()-1sec. (usage in javadoc
> > > > >  > > > > sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/
> > > > > kafka/KafkaIO.java#L138>
> > > > > )
> > > > >
> > > > > I think default watermark should be smarter. it should advance to
> > > current
> > > > > time if there aren't any records to read from Kafka. Could you
> file a
> > > > jira?
> > > > >
> > > > > thanks,
> > > > > Raghu.
> > > > >
> > > > > On Wed, Aug 24, 2016 at 2:10 PM, Chawla,Sumit <
> > sumitkcha...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi All
> > > > > >
> > > > > >
> > > > > > I am trying to do some simple batch processing on KafkaIO
> records.
> > > My
> > > > > beam
> > > > > > pipeline looks like following:
> > > > > >
> > > > > > pipeline.apply(KafkaIO.read()
> > > > > > .withTopics(ImmutableList.of(s"mytopic"))
> > > > > > .withBootstrapServers("localhost:9200")
> > > > > > .apply("ExtractMessage", ParDo.of(new ExtractKVMessage())) //
> > Emits a
> > > > > > KV
> > > > > >
> > > > > > .apply("WindowBy10Sec", Window. > > > > > JSONObject>>into(FixedWindows.of(Duration.standardSeconds(
> > > > > > 10))).withAllowedLateness(Duration.standardSeconds(1)))
> > > > > >
> > > > > > .apply("GroupByKey", GroupByKey.create())
> > > > > >
> > > > > > .apply("Sink", ParDo.of(new MySink())
> > > > > >
> > > > > >
> > > > > > My Kafka Source already has some messages 1000+, and new 

Re: KafkaIO Windowing Fn

2016-08-25 Thread Raghu Angadi
On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh 
wrote:

>
> We should still have a JIRA to improve the KafkaIO watermark tracking in
> the absence of new records .
>

filed https://issues.apache.org/jira/browse/BEAM-591

I don't want to hijack this thread Sumit's primary issue, but want to
mention related concerns here, which could be discussed on a new thread or
on the jira:
from the jira description :

A user can provide functions to calculate watermark and record timestamps.
There are a few concerns with current design:

   - What should happen when a kafka topic is idle:
  - in default case, I think watermark should advance to current time.
  - What should happen when user has provided a function to calculate
  record timestamp?
 - Should the watermark stay same as record timestamp?
 - same when user has provided own watermark function?
  - Are the current semantics of user provided watermark function
   correct?
  - it is run once for each record read.
  - Should it instead be run inside getWatermark() called by the runner
  (we could still provide the last user record, and its timestamp).


Re: KafkaIO Windowing Fn

2016-08-25 Thread Chawla,Sumit
Hi Thomas

That did not work.

I tried following instead:

.triggering(
Repeatedly.forever(
AfterFirst.of(
  AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)),
  AfterPane.elementCountAtLeast(100)
)))
.discardingFiredPanes()

What i am trying to do here.  This is to make sure that followup
operations receive batches of records.

1.  Fire when at Pane has 100+ elements

2.  Or Fire when the first element has been there for atleast 30 sec+.

However,  2 point does not seem to work.  e.g. I have 540 records in
Kafka.  The first 500 records are available immediately,

but the remaining 40 don't pass through. I was expecting 2nd to
trigger to help here.







Regards
Sumit Chawla


On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh 
wrote:

> You can adjust the trigger in the windowing transform if your sink can
> handle being written to multiple times for the same window. For example, if
> the sink appends to the output when it receives new data in a window, you
> could add something like
>
> Window.into(...).withAllowedLateness(...).triggering(AfterWatermark.
> pastEndOfWindow().withEarlyFirings(AfterProcessingTime.
> pastFirstElementInPane().withDelayOf(Duration.standardSeconds(5))).
> withLateFirings(AfterPane.elementCountAtLeast(1))).discardingFiredPanes();
>
> This will cause elements to be output some amount of time after they are
> first received from Kafka, even if Kafka does not have any new elements.
> Elements will only be output by the GroupByKey once.
>
> We should still have a JIRA to improve the KafkaIO watermark tracking in
> the absence of new records .
>
> On Thu, Aug 25, 2016 at 10:29 AM, Chawla,Sumit 
> wrote:
>
> > Thanks Raghu.
> >
> > I don't have much control over changing KafkaIO properties.  I added
> > KafkaIO code for completing the example.  Are there any changes that can
> be
> > done to Windowing to achieve the same behavior?
> >
> > Regards
> > Sumit Chawla
> >
> >
> > On Wed, Aug 24, 2016 at 5:06 PM, Raghu Angadi  >
> > wrote:
> >
> > > The default implementation returns processing timestamp of the last
> > record
> > > (in effect. more accurately it returns same as getTimestamp(), which
> > might
> > > overridden by user).
> > >
> > > As a work around, yes, you can provide your own watermarkFn that
> > > essentially returns Now() or Now()-1sec. (usage in javadoc
> > >  > > sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/
> > > kafka/KafkaIO.java#L138>
> > > )
> > >
> > > I think default watermark should be smarter. it should advance to
> current
> > > time if there aren't any records to read from Kafka. Could you file a
> > jira?
> > >
> > > thanks,
> > > Raghu.
> > >
> > > On Wed, Aug 24, 2016 at 2:10 PM, Chawla,Sumit 
> > > wrote:
> > >
> > > > Hi All
> > > >
> > > >
> > > > I am trying to do some simple batch processing on KafkaIO records.
> My
> > > beam
> > > > pipeline looks like following:
> > > >
> > > > pipeline.apply(KafkaIO.read()
> > > > .withTopics(ImmutableList.of(s"mytopic"))
> > > > .withBootstrapServers("localhost:9200")
> > > > .apply("ExtractMessage", ParDo.of(new ExtractKVMessage())) // Emits a
> > > > KV
> > > >
> > > > .apply("WindowBy10Sec", Window. > > > JSONObject>>into(FixedWindows.of(Duration.standardSeconds(
> > > > 10))).withAllowedLateness(Duration.standardSeconds(1)))
> > > >
> > > > .apply("GroupByKey", GroupByKey.create())
> > > >
> > > > .apply("Sink", ParDo.of(new MySink())
> > > >
> > > >
> > > > My Kafka Source already has some messages 1000+, and new messages
> > arrive
> > > > every few minutes.
> > > >
> > > > When i start my pipeline,  i can see that it reads all the 1000+
> > messages
> > > > from Kafka.  However, Window does not fire untill a new message
> arrives
> > > in
> > > > Kafka.  And Sink does not receive any message until that point.  Do i
> > > need
> > > > to override the WaterMarkFn here? Since i am not providing any
> > > timeStampFn
> > > > , i am assuming that timestamps will be assigned as in when message
> > > arrives
> > > > i.e. ingestion time.  What is the default WaterMarkFn implementation?
> > Is
> > > > the Window not supposed to be fired based on Ingestion time?
> > > >
> > > >
> > > >
> > > >
> > > > Regards
> > > > Sumit Chawla
> > > >
> > >
> >
>


Re: TextIO.Read.Bound class

2016-08-25 Thread Gaurav Gupta
I will create an issue and then work on it too..

Thanks
Gaurav

On Thu, Aug 25, 2016 at 12:47 PM, Lukasz Cwik 
wrote:

> Yes, that makes sense, feel free to create an issue or create a PR
> resolving this discrepancy.
> Taking a pass over the existing IO transforms would also be helpful.
>
> On Thu, Aug 25, 2016 at 11:42 AM, Gaurav Gupta 
> wrote:
>
> > Hi All,
> >
> > I am new to Apache beam and I was going through the word count example. I
> > found that TextIO.Read.Bound is used to read file.
> >
> > Should TextIO.Read.Bound not extend PTransform
> > instead of PTransform similar to KafkaIO.Read
> and
> > JMSIO.Read that extend PTransform?
> >
> > Thanks
> > Gaurav
> >
>


Re: TextIO.Read.Bound class

2016-08-25 Thread Lukasz Cwik
Yes, that makes sense, feel free to create an issue or create a PR
resolving this discrepancy.
Taking a pass over the existing IO transforms would also be helpful.

On Thu, Aug 25, 2016 at 11:42 AM, Gaurav Gupta 
wrote:

> Hi All,
>
> I am new to Apache beam and I was going through the word count example. I
> found that TextIO.Read.Bound is used to read file.
>
> Should TextIO.Read.Bound not extend PTransform
> instead of PTransform similar to KafkaIO.Read  and
> JMSIO.Read that extend PTransform?
>
> Thanks
> Gaurav
>


TextIO.Read.Bound class

2016-08-25 Thread Gaurav Gupta
Hi All,

I am new to Apache beam and I was going through the word count example. I
found that TextIO.Read.Bound is used to read file.

Should TextIO.Read.Bound not extend PTransform
instead of PTransform similar to KafkaIO.Read  and
JMSIO.Read that extend PTransform?

Thanks
Gaurav


Re: KafkaIO Windowing Fn

2016-08-25 Thread Chawla,Sumit
Thanks Raghu.

I don't have much control over changing KafkaIO properties.  I added
KafkaIO code for completing the example.  Are there any changes that can be
done to Windowing to achieve the same behavior?

Regards
Sumit Chawla


On Wed, Aug 24, 2016 at 5:06 PM, Raghu Angadi 
wrote:

> The default implementation returns processing timestamp of the last record
> (in effect. more accurately it returns same as getTimestamp(), which might
> overridden by user).
>
> As a work around, yes, you can provide your own watermarkFn that
> essentially returns Now() or Now()-1sec. (usage in javadoc
>  sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/
> kafka/KafkaIO.java#L138>
> )
>
> I think default watermark should be smarter. it should advance to current
> time if there aren't any records to read from Kafka. Could you file a jira?
>
> thanks,
> Raghu.
>
> On Wed, Aug 24, 2016 at 2:10 PM, Chawla,Sumit 
> wrote:
>
> > Hi All
> >
> >
> > I am trying to do some simple batch processing on KafkaIO records.  My
> beam
> > pipeline looks like following:
> >
> > pipeline.apply(KafkaIO.read()
> > .withTopics(ImmutableList.of(s"mytopic"))
> > .withBootstrapServers("localhost:9200")
> > .apply("ExtractMessage", ParDo.of(new ExtractKVMessage())) // Emits a
> > KV
> >
> > .apply("WindowBy10Sec", Window. > JSONObject>>into(FixedWindows.of(Duration.standardSeconds(
> > 10))).withAllowedLateness(Duration.standardSeconds(1)))
> >
> > .apply("GroupByKey", GroupByKey.create())
> >
> > .apply("Sink", ParDo.of(new MySink())
> >
> >
> > My Kafka Source already has some messages 1000+, and new messages arrive
> > every few minutes.
> >
> > When i start my pipeline,  i can see that it reads all the 1000+ messages
> > from Kafka.  However, Window does not fire untill a new message arrives
> in
> > Kafka.  And Sink does not receive any message until that point.  Do i
> need
> > to override the WaterMarkFn here? Since i am not providing any
> timeStampFn
> > , i am assuming that timestamps will be assigned as in when message
> arrives
> > i.e. ingestion time.  What is the default WaterMarkFn implementation? Is
> > the Window not supposed to be fired based on Ingestion time?
> >
> >
> >
> >
> > Regards
> > Sumit Chawla
> >
>


Re: Configuring IntelliJ to enforce checkstyle rules

2016-08-25 Thread Jean-Baptiste Onofré

Good idea.

Regards
JB

On 08/25/2016 05:51 PM, Seetharam Venkatesh wrote:

You can also export idea settings and interested developers can import it
instead.

Adding check style slows down idea significantly in my experience but ymmv.

On Wed, Aug 24, 2016 at 8:29 AM Kenneth Knowles 
wrote:


Nice step-by-step.

+1 to adding tips for particular IDEs in the contribution guide.

On Wed, Aug 24, 2016 at 7:48 AM, Jean-Baptiste Onofré 
wrote:


Hi Stas,

Thanks for sharing !

As discussed with Amit on Hangout (and indirectly with you ;)), it's what
I'm using in my config.

Some stuff that I added on IntelliJ:
- in Editor -> Code Style -> Java -> Tabs and Indents, I disabled "Use

tab

character", and defined 2 for "Tab Size" and "Indent", and 4 for
"Continuation indent".
- in Editor -> Code Style -> Java -> Imports, I changed the layout to
match the checkstyle definition (static first, ,
org.apache.beam.*, , com.google.*, ...)
- in Editor -> Code Style -> Java -> Wrapping and Braces, I enabled
"Ensure right margin is not exceeded"

I also enabled checkstyle in code inspection, and the checkstyle button
(next to the terminal button in the button bar).

Related to the discussion about checkstyle, I think it makes sense to add
a section about "Configuring IDE" in the contribution guide.

WDYT ?

Regards
JB


On 08/24/2016 04:35 PM, Stas Levin wrote:


Hi guys,

Having IntelliJ enforce Beam's Checkstyle rules turned out to be very
useful for me, so I figured I'd share the steps just in case.

   1. Install the Checkstyle plugin
  1. Select the *Plugins* menu from the *Preferences* (Cmd+",")
  2. Click "*Browse Repositories*"
  3. Type "*Checkstyle-IDEA*" in the search box and select the

search

  result
  4. Click "*install*"
  5. *Restart* Intellij
   2. Activate Beam's checkstyle rules
  1. Select the newly added "*Checkstyle*" menu from the Preferences
  (Cmd+",") menu
  2. Hit the "*+*" icon to add a checkstyle file, *select Beam's
  checkstyle.xm*l
  "..incubator-beam/sdks/java/build-tools/src/main/resources/
beam/checkstyle.xml"
  3. Check the "*Active*" checkbox to activate
   3. Done!

Hope you find this useful.

Regards,
Stas



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Configuring IntelliJ to enforce checkstyle rules

2016-08-25 Thread Seetharam Venkatesh
You can also export idea settings and interested developers can import it
instead.

Adding check style slows down idea significantly in my experience but ymmv.

On Wed, Aug 24, 2016 at 8:29 AM Kenneth Knowles 
wrote:

> Nice step-by-step.
>
> +1 to adding tips for particular IDEs in the contribution guide.
>
> On Wed, Aug 24, 2016 at 7:48 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Stas,
> >
> > Thanks for sharing !
> >
> > As discussed with Amit on Hangout (and indirectly with you ;)), it's what
> > I'm using in my config.
> >
> > Some stuff that I added on IntelliJ:
> > - in Editor -> Code Style -> Java -> Tabs and Indents, I disabled "Use
> tab
> > character", and defined 2 for "Tab Size" and "Indent", and 4 for
> > "Continuation indent".
> > - in Editor -> Code Style -> Java -> Imports, I changed the layout to
> > match the checkstyle definition (static first, ,
> > org.apache.beam.*, , com.google.*, ...)
> > - in Editor -> Code Style -> Java -> Wrapping and Braces, I enabled
> > "Ensure right margin is not exceeded"
> >
> > I also enabled checkstyle in code inspection, and the checkstyle button
> > (next to the terminal button in the button bar).
> >
> > Related to the discussion about checkstyle, I think it makes sense to add
> > a section about "Configuring IDE" in the contribution guide.
> >
> > WDYT ?
> >
> > Regards
> > JB
> >
> >
> > On 08/24/2016 04:35 PM, Stas Levin wrote:
> >
> >> Hi guys,
> >>
> >> Having IntelliJ enforce Beam's Checkstyle rules turned out to be very
> >> useful for me, so I figured I'd share the steps just in case.
> >>
> >>1. Install the Checkstyle plugin
> >>   1. Select the *Plugins* menu from the *Preferences* (Cmd+",")
> >>   2. Click "*Browse Repositories*"
> >>   3. Type "*Checkstyle-IDEA*" in the search box and select the
> search
> >>   result
> >>   4. Click "*install*"
> >>   5. *Restart* Intellij
> >>2. Activate Beam's checkstyle rules
> >>   1. Select the newly added "*Checkstyle*" menu from the Preferences
> >>   (Cmd+",") menu
> >>   2. Hit the "*+*" icon to add a checkstyle file, *select Beam's
> >>   checkstyle.xm*l
> >>   "..incubator-beam/sdks/java/build-tools/src/main/resources/
> >> beam/checkstyle.xml"
> >>   3. Check the "*Active*" checkbox to activate
> >>3. Done!
> >>
> >> Hope you find this useful.
> >>
> >> Regards,
> >> Stas
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>