Fair enough ;) Let me review the different Jira and provide some feedback.
Regards JB On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov <kirpic...@google.com.INVALID> wrote: >Hi JB, >I haven't yet thought about how this work can be parallelized. For now >I'd >like to just get feedback on the approach :) >But glad that you're willing to help out - let's discuss this too a bit >later! > >On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <j...@nanthrax.net> >wrote: > >> Thanks Eugene >> >> I will pick up some. >> >> Regards >> JB >> >> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov >> <kirpic...@google.com.INVALID> wrote: >> >Filed JIRAs for the proposed features and linked with the doc: >> >https://issues.apache.org/jira/browse/BEAM-2511 TextIO should >support >> >reading a PCollection of filenames >> >https://issues.apache.org/jira/browse/BEAM-2512 TextIO should >support >> >watching for new files >> >https://issues.apache.org/jira/browse/BEAM-2513 TextIO should >support >> >watching files for new entries >> > >> >On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov ><kirpic...@google.com> >> >wrote: >> > >> >> Hi all, >> >> >> >> I've written up a proposal for incrementally delivering a bunch of >> >useful >> >> new features in TextIO based on Splittable DoFn. It's applicable >to >> >other >> >> file-based connectors, TextIO is just one good example. Let me >know >> >what >> >> you think! >> >> >> >> https://s.apache.org/textio-sdf >> >> >> >> Copy of abstract: >> >> >> >> Users have often expressed interest in several new features for >> >reading >> >> files - in particular, incremental reading of log files (streaming >of >> >new >> >> files matching a pattern and new entries in each file) and reading >a >> >> PCollection of filenames (in particular, an unbounded collection >> >arriving >> >> from a stream such as PubSub or Kafka). >> >> >> >> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) >enables >> >> these features. This document proposes an API for them, using the >> >example >> >> of TextIO, and proposes and a plan for delivering them subject to >> >> availability of SDF in different runners. Some availability >> >constraints are >> >> circumvented by Running Splittable DoFn via Source API >> >> <http://s.apache.org/sdf-via-source>. >> >> >> >> TL;DR Read a collection of filepatterns arriving on PubSub via >> >> files.apply(TextIO.readEach()). Tail a filepattern via >> >> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming >to >> >a >> >> Beam SDK near you in small pieces. >> >> >> >> I think I'm gonna start working on the first steps of the proposed >> >plan, >> >> in parallel with this discussion, because I'm excited :) >> >> >>