Hi JB,
I haven't yet thought about how this work can be parallelized. For now I'd
like to just get feedback on the approach :)
But glad that you're willing to help out - let's discuss this too a bit
later!

On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Thanks Eugene
>
> I will pick up some.
>
> Regards
> JB
>
> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov
> <kirpic...@google.com.INVALID> wrote:
> >Filed JIRAs for the proposed features and linked with the doc:
> >https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support
> >reading a PCollection of filenames
> >https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support
> >watching for new files
> >https://issues.apache.org/jira/browse/BEAM-2513 TextIO should support
> >watching files for new entries
> >
> >On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov <kirpic...@google.com>
> >wrote:
> >
> >> Hi all,
> >>
> >> I've written up a proposal for incrementally delivering a bunch of
> >useful
> >> new features in TextIO based on Splittable DoFn. It's applicable to
> >other
> >> file-based connectors, TextIO is just one good example. Let me know
> >what
> >> you think!
> >>
> >> https://s.apache.org/textio-sdf
> >>
> >> Copy of abstract:
> >>
> >> Users have often expressed interest in several new features for
> >reading
> >> files - in particular, incremental reading of log files (streaming of
> >new
> >> files matching a pattern and new entries in each file) and reading a
> >> PCollection of filenames (in particular, an unbounded collection
> >arriving
> >> from a stream such as PubSub or Kafka).
> >>
> >> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) enables
> >> these features. This document proposes an API for them, using the
> >example
> >> of TextIO, and proposes and a plan for delivering them subject to
> >> availability of SDF in different runners. Some availability
> >constraints are
> >> circumvented by Running Splittable DoFn via Source API
> >> <http://s.apache.org/sdf-via-source>.
> >>
> >> TL;DR Read a collection of filepatterns arriving on PubSub via
> >> files.apply(TextIO.readEach()). Tail a filepattern via
> >> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming to
> >a
> >> Beam SDK near you in small pieces.
> >>
> >> I think I'm gonna start working on the first steps of the proposed
> >plan,
> >> in parallel with this discussion, because I'm excited :)
> >>
>

Reply via email to