Thanks all. The first PR is out for review: https://github.com/apache/beam/pull/3443 Next work (watching for new files) is in progress, based on https://github.com/apache/beam/pull/3360
On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles <k...@google.com.invalid> wrote: > +1 > > This is a really nice doc and plan. > > On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek <aljos...@apache.org> > wrote: > > > +1 > > > > This sounds very good and there is a clear implementation path! > > > > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > > > > Fair enough ;) > > > > > > Let me review the different Jira and provide some feedback. > > > > > > Regards > > > JB > > > > > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov > > <kirpic...@google.com.INVALID> wrote: > > >> Hi JB, > > >> I haven't yet thought about how this work can be parallelized. For now > > >> I'd > > >> like to just get feedback on the approach :) > > >> But glad that you're willing to help out - let's discuss this too a > bit > > >> later! > > >> > > >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré < > j...@nanthrax.net> > > >> wrote: > > >> > > >>> Thanks Eugene > > >>> > > >>> I will pick up some. > > >>> > > >>> Regards > > >>> JB > > >>> > > >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov > > >>> <kirpic...@google.com.INVALID> wrote: > > >>>> Filed JIRAs for the proposed features and linked with the doc: > > >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should > > >> support > > >>>> reading a PCollection of filenames > > >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should > > >> support > > >>>> watching for new files > > >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should > > >> support > > >>>> watching files for new entries > > >>>> > > >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov > > >> <kirpic...@google.com> > > >>>> wrote: > > >>>> > > >>>>> Hi all, > > >>>>> > > >>>>> I've written up a proposal for incrementally delivering a bunch of > > >>>> useful > > >>>>> new features in TextIO based on Splittable DoFn. It's applicable > > >> to > > >>>> other > > >>>>> file-based connectors, TextIO is just one good example. Let me > > >> know > > >>>> what > > >>>>> you think! > > >>>>> > > >>>>> https://s.apache.org/textio-sdf > > >>>>> > > >>>>> Copy of abstract: > > >>>>> > > >>>>> Users have often expressed interest in several new features for > > >>>> reading > > >>>>> files - in particular, incremental reading of log files (streaming > > >> of > > >>>> new > > >>>>> files matching a pattern and new entries in each file) and reading > > >> a > > >>>>> PCollection of filenames (in particular, an unbounded collection > > >>>> arriving > > >>>>> from a stream such as PubSub or Kafka). > > >>>>> > > >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) > > >> enables > > >>>>> these features. This document proposes an API for them, using the > > >>>> example > > >>>>> of TextIO, and proposes and a plan for delivering them subject to > > >>>>> availability of SDF in different runners. Some availability > > >>>> constraints are > > >>>>> circumvented by Running Splittable DoFn via Source API > > >>>>> <http://s.apache.org/sdf-via-source>. > > >>>>> > > >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via > > >>>>> files.apply(TextIO.readEach()). Tail a filepattern via > > >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming > > >> to > > >>>> a > > >>>>> Beam SDK near you in small pieces. > > >>>>> > > >>>>> I think I'm gonna start working on the first steps of the proposed > > >>>> plan, > > >>>>> in parallel with this discussion, because I'm excited :) > > >>>>> > > >>> > > > > >