BTW - I am worried about SDF storing everything in a single tag for watch. The problem is that streaming pipeline can run "forever." So someone watching a GCS bucket "forever" will eventually crash due to the value getting too large. Is there any reasonable way to garbage collect this state?
On Tue, Jul 11, 2017 at 9:08 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > First PR has been submitted - enjoy TextIO.readAll() which reads a > PCollection of filenames! > I've started working on the SDF-based Watch transform > http://s.apache.org/beam-watch-transform, and after that will be able to > implement the incremental features in TextIO. > > On Tue, Jun 27, 2017 at 1:55 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > > > Thanks all. The first PR is out for review: > > https://github.com/apache/beam/pull/3443 > > Next work (watching for new files) is in progress, based on > > https://github.com/apache/beam/pull/3360 > > > > On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles <k...@google.com.invalid > > > > wrote: > > > >> +1 > >> > >> This is a really nice doc and plan. > >> > >> On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek <aljos...@apache.org> > >> wrote: > >> > >> > +1 > >> > > >> > This sounds very good and there is a clear implementation path! > >> > > >> > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <j...@nanthrax.net> > >> wrote: > >> > > > >> > > Fair enough ;) > >> > > > >> > > Let me review the different Jira and provide some feedback. > >> > > > >> > > Regards > >> > > JB > >> > > > >> > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov > >> > <kirpic...@google.com.INVALID> wrote: > >> > >> Hi JB, > >> > >> I haven't yet thought about how this work can be parallelized. For > >> now > >> > >> I'd > >> > >> like to just get feedback on the approach :) > >> > >> But glad that you're willing to help out - let's discuss this too a > >> bit > >> > >> later! > >> > >> > >> > >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré < > >> j...@nanthrax.net> > >> > >> wrote: > >> > >> > >> > >>> Thanks Eugene > >> > >>> > >> > >>> I will pick up some. > >> > >>> > >> > >>> Regards > >> > >>> JB > >> > >>> > >> > >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov > >> > >>> <kirpic...@google.com.INVALID> wrote: > >> > >>>> Filed JIRAs for the proposed features and linked with the doc: > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should > >> > >> support > >> > >>>> reading a PCollection of filenames > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should > >> > >> support > >> > >>>> watching for new files > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should > >> > >> support > >> > >>>> watching files for new entries > >> > >>>> > >> > >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov > >> > >> <kirpic...@google.com> > >> > >>>> wrote: > >> > >>>> > >> > >>>>> Hi all, > >> > >>>>> > >> > >>>>> I've written up a proposal for incrementally delivering a bunch > of > >> > >>>> useful > >> > >>>>> new features in TextIO based on Splittable DoFn. It's applicable > >> > >> to > >> > >>>> other > >> > >>>>> file-based connectors, TextIO is just one good example. Let me > >> > >> know > >> > >>>> what > >> > >>>>> you think! > >> > >>>>> > >> > >>>>> https://s.apache.org/textio-sdf > >> > >>>>> > >> > >>>>> Copy of abstract: > >> > >>>>> > >> > >>>>> Users have often expressed interest in several new features for > >> > >>>> reading > >> > >>>>> files - in particular, incremental reading of log files > (streaming > >> > >> of > >> > >>>> new > >> > >>>>> files matching a pattern and new entries in each file) and > reading > >> > >> a > >> > >>>>> PCollection of filenames (in particular, an unbounded collection > >> > >>>> arriving > >> > >>>>> from a stream such as PubSub or Kafka). > >> > >>>>> > >> > >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) > >> > >> enables > >> > >>>>> these features. This document proposes an API for them, using > the > >> > >>>> example > >> > >>>>> of TextIO, and proposes and a plan for delivering them subject > to > >> > >>>>> availability of SDF in different runners. Some availability > >> > >>>> constraints are > >> > >>>>> circumvented by Running Splittable DoFn via Source API > >> > >>>>> <http://s.apache.org/sdf-via-source>. > >> > >>>>> > >> > >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via > >> > >>>>> files.apply(TextIO.readEach()). Tail a filepattern via > >> > >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). > Coming > >> > >> to > >> > >>>> a > >> > >>>>> Beam SDK near you in small pieces. > >> > >>>>> > >> > >>>>> I think I'm gonna start working on the first steps of the > proposed > >> > >>>> plan, > >> > >>>>> in parallel with this discussion, because I'm excited :) > >> > >>>>> > >> > >>> > >> > > >> > > >> > > >