Hi all,

I've written up a proposal for incrementally delivering a bunch of useful
new features in TextIO based on Splittable DoFn. It's applicable to other
file-based connectors, TextIO is just one good example. Let me know what
you think!

https://s.apache.org/textio-sdf

Copy of abstract:

Users have often expressed interest in several new features for reading
files - in particular, incremental reading of log files (streaming of new
files matching a pattern and new entries in each file) and reading a
PCollection of filenames (in particular, an unbounded collection arriving
from a stream such as PubSub or Kafka).

Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) enables these
features. This document proposes an API for them, using the example of
TextIO, and proposes and a plan for delivering them subject to availability
of SDF in different runners. Some availability constraints are circumvented
by Running Splittable DoFn via Source API
<http://s.apache.org/sdf-via-source>.

TL;DR Read a collection of filepatterns arriving on PubSub via
files.apply(TextIO.readEach()). Tail a filepattern via
TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming to a
Beam SDK near you in small pieces.

I think I'm gonna start working on the first steps of the proposed plan, in
parallel with this discussion, because I'm excited :)

Reply via email to