+1 This sounds very good and there is a clear implementation path!
> On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > > Fair enough ;) > > Let me review the different Jira and provide some feedback. > > Regards > JB > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov > <kirpic...@google.com.INVALID> wrote: >> Hi JB, >> I haven't yet thought about how this work can be parallelized. For now >> I'd >> like to just get feedback on the approach :) >> But glad that you're willing to help out - let's discuss this too a bit >> later! >> >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Thanks Eugene >>> >>> I will pick up some. >>> >>> Regards >>> JB >>> >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov >>> <kirpic...@google.com.INVALID> wrote: >>>> Filed JIRAs for the proposed features and linked with the doc: >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should >> support >>>> reading a PCollection of filenames >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should >> support >>>> watching for new files >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should >> support >>>> watching files for new entries >>>> >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov >> <kirpic...@google.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've written up a proposal for incrementally delivering a bunch of >>>> useful >>>>> new features in TextIO based on Splittable DoFn. It's applicable >> to >>>> other >>>>> file-based connectors, TextIO is just one good example. Let me >> know >>>> what >>>>> you think! >>>>> >>>>> https://s.apache.org/textio-sdf >>>>> >>>>> Copy of abstract: >>>>> >>>>> Users have often expressed interest in several new features for >>>> reading >>>>> files - in particular, incremental reading of log files (streaming >> of >>>> new >>>>> files matching a pattern and new entries in each file) and reading >> a >>>>> PCollection of filenames (in particular, an unbounded collection >>>> arriving >>>>> from a stream such as PubSub or Kafka). >>>>> >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) >> enables >>>>> these features. This document proposes an API for them, using the >>>> example >>>>> of TextIO, and proposes and a plan for delivering them subject to >>>>> availability of SDF in different runners. Some availability >>>> constraints are >>>>> circumvented by Running Splittable DoFn via Source API >>>>> <http://s.apache.org/sdf-via-source>. >>>>> >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via >>>>> files.apply(TextIO.readEach()). Tail a filepattern via >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming >> to >>>> a >>>>> Beam SDK near you in small pieces. >>>>> >>>>> I think I'm gonna start working on the first steps of the proposed >>>> plan, >>>>> in parallel with this discussion, because I'm excited :) >>>>> >>>