On Tue, Mar 21, 2017 at 7:26 PM Mingmin Xu <mingm...@gmail.com> wrote:
> Move discuss to dev-list > > Savepoint in Flink, also checkpoint in Spark, should be good enough to > handle this case. > > When people don't enable these features, for example only need at-most-once > The Spark runner forces checkpointing on any streaming (Beam) application, mostly because it uses mapWithState for reading from UnboundedSource and updateStateByKey form GroupByKey - so by design, Spark runner is at-least-once. Generally, I always thought that applications that require at-most-once are more focused on processing time only, as they only care about whatever get's ingested into the pipeline at a specific time and don't care (up to the point of losing data) about correctness. I would be happy to hear more about your use case. > semantic, each unbounded IO should try its best to restore from last > offset, although CheckpointMark is null. Any ideas? > > Mingmin > > On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin <dhalp...@apache.org> wrote: > > > hey, > > > > The native Beam UnboundedSource API supports resuming from checkpoint -- > > that specifically happens here > > < > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674> > when > > the KafkaCheckpointMark is non-null. > > > > The FlinkRunner should be providing the KafkaCheckpointMark from the most > > recent savepoint upon restore. > > > > There shouldn't be any "special" Flink runner support needed, nor is the > > State API involved. > > > > Dan > > > > On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > >> Would not it be Flink runner specific ? > >> > >> Maybe the State API could do the same in a runner agnostic way (just > >> thinking loud) ? > >> > >> Regards > >> JB > >> > >> On 03/21/2017 04:56 PM, Mingmin Xu wrote: > >> > >>> From KafkaIO itself, looks like it either start_from_beginning or > >>> start_from_latest. It's designed to leverage > >>> `UnboundedSource.CheckpointMark` > >>> during initialization, but so far I don't see it's provided by runners. > >>> At the > >>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775 > >>> <https://issues.apache.org/jira/browse/BEAM-1775>) to handle it in > >>> KafkaIO. > >>> > >>> Mingmin > >>> > >>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek <aljos...@apache.org > >>> <mailto:aljos...@apache.org>> wrote: > >>> > >>> Hi, > >>> Are you using Flink savepoints [1] when restoring your application? > >>> If you > >>> use this the Kafka offset should be stored in state and it should > >>> restart > >>> from the correct position. > >>> > >>> Best, > >>> Aljoscha > >>> > >>> [1] > >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/ > >>> setup/savepoints.html > >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.3 > >>> /setup/savepoints.html> > >>> > On 21 Mar 2017, at 01:50, Jins George <jins.geo...@aeris.net > >>> <mailto:jins.geo...@aeris.net>> wrote: > >>> > > >>> > Hello, > >>> > > >>> > I am writing a Beam pipeline(streaming) with Flink runner to > >>> consume data > >>> from Kafka and apply some transformations and persist to Hbase. > >>> > > >>> > If I restart the application ( due to failure/manual restart), > >>> consumer > >>> does not resume from the offset where it was prior to restart. It > >>> always > >>> resume from the latest offset. > >>> > > >>> > If I enable Flink checkpionting with hdfs state back-end, system > >>> appears > >>> to be resuming from the earliest offset > >>> > > >>> > Is there a recommended way to resume from the offset where it was > >>> stopped ? > >>> > > >>> > Thanks, > >>> > Jins George > >>> > >>> > >>> > >>> > >>> -- > >>> ---- > >>> Mingmin > >>> > >> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com > >> > > > > > > > -- > ---- > Mingmin >