Hi JB, I haven't yet thought about how this work can be parallelized. For now I'd like to just get feedback on the approach :) But glad that you're willing to help out - let's discuss this too a bit later!
On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Thanks Eugene > > I will pick up some. > > Regards > JB > > On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov > <kirpic...@google.com.INVALID> wrote: > >Filed JIRAs for the proposed features and linked with the doc: > >https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support > >reading a PCollection of filenames > >https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support > >watching for new files > >https://issues.apache.org/jira/browse/BEAM-2513 TextIO should support > >watching files for new entries > > > >On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov <kirpic...@google.com> > >wrote: > > > >> Hi all, > >> > >> I've written up a proposal for incrementally delivering a bunch of > >useful > >> new features in TextIO based on Splittable DoFn. It's applicable to > >other > >> file-based connectors, TextIO is just one good example. Let me know > >what > >> you think! > >> > >> https://s.apache.org/textio-sdf > >> > >> Copy of abstract: > >> > >> Users have often expressed interest in several new features for > >reading > >> files - in particular, incremental reading of log files (streaming of > >new > >> files matching a pattern and new entries in each file) and reading a > >> PCollection of filenames (in particular, an unbounded collection > >arriving > >> from a stream such as PubSub or Kafka). > >> > >> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) enables > >> these features. This document proposes an API for them, using the > >example > >> of TextIO, and proposes and a plan for delivering them subject to > >> availability of SDF in different runners. Some availability > >constraints are > >> circumvented by Running Splittable DoFn via Source API > >> <http://s.apache.org/sdf-via-source>. > >> > >> TL;DR Read a collection of filepatterns arriving on PubSub via > >> files.apply(TextIO.readEach()). Tail a filepattern via > >> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming to > >a > >> Beam SDK near you in small pieces. > >> > >> I think I'm gonna start working on the first steps of the proposed > >plan, > >> in parallel with this discussion, because I'm excited :) > >> >