BTW - I am worried about SDF storing everything in a single tag for watch.
The problem is that streaming pipeline can run "forever." So someone
watching a GCS bucket "forever" will eventually crash due to the value
getting too large. Is there any reasonable way to garbage collect this
state?

On Tue, Jul 11, 2017 at 9:08 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> First PR has been submitted - enjoy TextIO.readAll() which reads a
> PCollection of filenames!
> I've started working on the SDF-based Watch transform
> http://s.apache.org/beam-watch-transform, and after that will be able to
> implement the incremental features in TextIO.
>
> On Tue, Jun 27, 2017 at 1:55 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
> > Thanks all. The first PR is out for review:
> > https://github.com/apache/beam/pull/3443
> > Next work (watching for new files) is in progress, based on
> > https://github.com/apache/beam/pull/3360
> >
> > On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles <k...@google.com.invalid
> >
> > wrote:
> >
> >> +1
> >>
> >> This is a really nice doc and plan.
> >>
> >> On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek <aljos...@apache.org>
> >> wrote:
> >>
> >> > +1
> >> >
> >> > This sounds very good and there is a clear implementation path!
> >> >
> >> > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >> > >
> >> > > Fair enough ;)
> >> > >
> >> > > Let me review the different Jira and provide some feedback.
> >> > >
> >> > > Regards
> >> > > JB
> >> > >
> >> > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov
> >> > <kirpic...@google.com.INVALID> wrote:
> >> > >> Hi JB,
> >> > >> I haven't yet thought about how this work can be parallelized. For
> >> now
> >> > >> I'd
> >> > >> like to just get feedback on the approach :)
> >> > >> But glad that you're willing to help out - let's discuss this too a
> >> bit
> >> > >> later!
> >> > >>
> >> > >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <
> >> j...@nanthrax.net>
> >> > >> wrote:
> >> > >>
> >> > >>> Thanks Eugene
> >> > >>>
> >> > >>> I will pick up some.
> >> > >>>
> >> > >>> Regards
> >> > >>> JB
> >> > >>>
> >> > >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov
> >> > >>> <kirpic...@google.com.INVALID> wrote:
> >> > >>>> Filed JIRAs for the proposed features and linked with the doc:
> >> > >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should
> >> > >> support
> >> > >>>> reading a PCollection of filenames
> >> > >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should
> >> > >> support
> >> > >>>> watching for new files
> >> > >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should
> >> > >> support
> >> > >>>> watching files for new entries
> >> > >>>>
> >> > >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov
> >> > >> <kirpic...@google.com>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>>> Hi all,
> >> > >>>>>
> >> > >>>>> I've written up a proposal for incrementally delivering a bunch
> of
> >> > >>>> useful
> >> > >>>>> new features in TextIO based on Splittable DoFn. It's applicable
> >> > >> to
> >> > >>>> other
> >> > >>>>> file-based connectors, TextIO is just one good example. Let me
> >> > >> know
> >> > >>>> what
> >> > >>>>> you think!
> >> > >>>>>
> >> > >>>>> https://s.apache.org/textio-sdf
> >> > >>>>>
> >> > >>>>> Copy of abstract:
> >> > >>>>>
> >> > >>>>> Users have often expressed interest in several new features for
> >> > >>>> reading
> >> > >>>>> files - in particular, incremental reading of log files
> (streaming
> >> > >> of
> >> > >>>> new
> >> > >>>>> files matching a pattern and new entries in each file) and
> reading
> >> > >> a
> >> > >>>>> PCollection of filenames (in particular, an unbounded collection
> >> > >>>> arriving
> >> > >>>>> from a stream such as PubSub or Kafka).
> >> > >>>>>
> >> > >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF)
> >> > >> enables
> >> > >>>>> these features. This document proposes an API for them, using
> the
> >> > >>>> example
> >> > >>>>> of TextIO, and proposes and a plan for delivering them subject
> to
> >> > >>>>> availability of SDF in different runners. Some availability
> >> > >>>> constraints are
> >> > >>>>> circumvented by Running Splittable DoFn via Source API
> >> > >>>>> <http://s.apache.org/sdf-via-source>.
> >> > >>>>>
> >> > >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via
> >> > >>>>> files.apply(TextIO.readEach()). Tail a filepattern via
> >> > >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries().
> Coming
> >> > >> to
> >> > >>>> a
> >> > >>>>> Beam SDK near you in small pieces.
> >> > >>>>>
> >> > >>>>> I think I'm gonna start working on the first steps of the
> proposed
> >> > >>>> plan,
> >> > >>>>> in parallel with this discussion, because I'm excited :)
> >> > >>>>>
> >> > >>>
> >> >
> >> >
> >>
> >
>

Reply via email to