In the current version, the transform is intended to watch a set that is
continuously growing; do you mean a GCS bucket that eventually contains
more files than can fit in a state tag?

I agree that this will eventually become an issue; I can see a couple of
solutions:
- I suspect many such sets are highly compressible, so we can use a coder
that compresses things and get some headroom.
- When an element disappears from a set, we can remove it from the state
(without emitting anything into the transform's output - just for GC
purposes). Of course this assumes that elements actually disappear from the
set (e.g. get removed from the GCS bucket).
- There might be a way to shard the set using a GBK. I'm not quite sure how
it would look in the transform, in particular how the termination condition
would look like - because polling would need to happen before the GBK, and
termination conditions such as "no new elements observed" depend on
information in shards that's after the GBK.

On Tue, Jul 11, 2017 at 9:23 PM Reuven Lax <re...@google.com.invalid> wrote:

> BTW - I am worried about SDF storing everything in a single tag for watch.
> The problem is that streaming pipeline can run "forever." So someone
> watching a GCS bucket "forever" will eventually crash due to the value
> getting too large. Is there any reasonable way to garbage collect this
> state?
>
> On Tue, Jul 11, 2017 at 9:08 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > First PR has been submitted - enjoy TextIO.readAll() which reads a
> > PCollection of filenames!
> > I've started working on the SDF-based Watch transform
> > http://s.apache.org/beam-watch-transform, and after that will be able to
> > implement the incremental features in TextIO.
> >
> > On Tue, Jun 27, 2017 at 1:55 PM Eugene Kirpichov <kirpic...@google.com>
> > wrote:
> >
> > > Thanks all. The first PR is out for review:
> > > https://github.com/apache/beam/pull/3443
> > > Next work (watching for new files) is in progress, based on
> > > https://github.com/apache/beam/pull/3360
> > >
> > > On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > wrote:
> > >
> > >> +1
> > >>
> > >> This is a really nice doc and plan.
> > >>
> > >> On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek <
> aljos...@apache.org>
> > >> wrote:
> > >>
> > >> > +1
> > >> >
> > >> > This sounds very good and there is a clear implementation path!
> > >> >
> > >> > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <j...@nanthrax.net>
> > >> wrote:
> > >> > >
> > >> > > Fair enough ;)
> > >> > >
> > >> > > Let me review the different Jira and provide some feedback.
> > >> > >
> > >> > > Regards
> > >> > > JB
> > >> > >
> > >> > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov
> > >> > <kirpic...@google.com.INVALID> wrote:
> > >> > >> Hi JB,
> > >> > >> I haven't yet thought about how this work can be parallelized.
> For
> > >> now
> > >> > >> I'd
> > >> > >> like to just get feedback on the approach :)
> > >> > >> But glad that you're willing to help out - let's discuss this
> too a
> > >> bit
> > >> > >> later!
> > >> > >>
> > >> > >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <
> > >> j...@nanthrax.net>
> > >> > >> wrote:
> > >> > >>
> > >> > >>> Thanks Eugene
> > >> > >>>
> > >> > >>> I will pick up some.
> > >> > >>>
> > >> > >>> Regards
> > >> > >>> JB
> > >> > >>>
> > >> > >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov
> > >> > >>> <kirpic...@google.com.INVALID> wrote:
> > >> > >>>> Filed JIRAs for the proposed features and linked with the doc:
> > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should
> > >> > >> support
> > >> > >>>> reading a PCollection of filenames
> > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should
> > >> > >> support
> > >> > >>>> watching for new files
> > >> > >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should
> > >> > >> support
> > >> > >>>> watching files for new entries
> > >> > >>>>
> > >> > >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov
> > >> > >> <kirpic...@google.com>
> > >> > >>>> wrote:
> > >> > >>>>
> > >> > >>>>> Hi all,
> > >> > >>>>>
> > >> > >>>>> I've written up a proposal for incrementally delivering a
> bunch
> > of
> > >> > >>>> useful
> > >> > >>>>> new features in TextIO based on Splittable DoFn. It's
> applicable
> > >> > >> to
> > >> > >>>> other
> > >> > >>>>> file-based connectors, TextIO is just one good example. Let me
> > >> > >> know
> > >> > >>>> what
> > >> > >>>>> you think!
> > >> > >>>>>
> > >> > >>>>> https://s.apache.org/textio-sdf
> > >> > >>>>>
> > >> > >>>>> Copy of abstract:
> > >> > >>>>>
> > >> > >>>>> Users have often expressed interest in several new features
> for
> > >> > >>>> reading
> > >> > >>>>> files - in particular, incremental reading of log files
> > (streaming
> > >> > >> of
> > >> > >>>> new
> > >> > >>>>> files matching a pattern and new entries in each file) and
> > reading
> > >> > >> a
> > >> > >>>>> PCollection of filenames (in particular, an unbounded
> collection
> > >> > >>>> arriving
> > >> > >>>>> from a stream such as PubSub or Kafka).
> > >> > >>>>>
> > >> > >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF)
> > >> > >> enables
> > >> > >>>>> these features. This document proposes an API for them, using
> > the
> > >> > >>>> example
> > >> > >>>>> of TextIO, and proposes and a plan for delivering them subject
> > to
> > >> > >>>>> availability of SDF in different runners. Some availability
> > >> > >>>> constraints are
> > >> > >>>>> circumvented by Running Splittable DoFn via Source API
> > >> > >>>>> <http://s.apache.org/sdf-via-source>.
> > >> > >>>>>
> > >> > >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via
> > >> > >>>>> files.apply(TextIO.readEach()). Tail a filepattern via
> > >> > >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries().
> > Coming
> > >> > >> to
> > >> > >>>> a
> > >> > >>>>> Beam SDK near you in small pieces.
> > >> > >>>>>
> > >> > >>>>> I think I'm gonna start working on the first steps of the
> > proposed
> > >> > >>>> plan,
> > >> > >>>>> in parallel with this discussion, because I'm excited :)
> > >> > >>>>>
> > >> > >>>
> > >> >
> > >> >
> > >>
> > >
> >
>

Reply via email to