Hi! Glad the refactoring is happening, thanks! It was auto-assigned to Reuven as formal owner of the component. I reassigned it to you.
On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com> wrote: > Hi > > I started looking at > https://issues.apache.org/jira/browse/BEAM-2994 > > and pushed some initial code to my tikaio branch introducing ParseResult > and updating the tests but keeping the BounderSource/Reader, dropping > the asynchronous parsing code, and few other bits. > > Just noticed it is assigned to Reuven - does it mean Reuven is looking > into it too or was it auto-assigned ? > > I don't mind, would it make sense for me to do an 'interim' PR on > what've done so far before completely removing BoundedSource/Reader > based code ? > Yes :) > > I have another question anyway, > > > > E.g. TikaIO could: > > - take as input a PCollection<ReadableFile> > > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult > > is a class with properties { String content, Metadata metadata } > > - be configured by: a Parser (it implements Serializable so can be > > specified at pipeline construction time) and a ContentHandler whose > > toString() will go into "content". ContentHandler does not implement > > Serializable, so you can not specify it at construction time - however, > you > > can let the user specify either its class (if it's a simple handler like > a > > BodyContentHandler) or specify a lambda for creating the handler > > (SerializableFunction<Void, ContentHandler>), and potentially you can > have > > a simpler facade for Tika.parseAsString() - e.g. call it > > TikaIO.parseAllAsStrings(). > > > > Example usage would look like: > > > > PCollection<KV<String, ParseResult>> parseResults = > > p.apply(FileIO.match().filepattern(...)) > > .apply(FileIO.readMatches()) > > .apply(TikaIO.parseAllAsStrings()) > > > > or: > > > > .apply(TikaIO.parseAll() > > .withParser(new AutoDetectParser()) > > .withContentHandler(() -> new BodyContentHandler(new > > ToXMLContentHandler()))) > > > > You could also have shorthands for letting the user avoid using FileIO > > directly in simple cases, for example: > > p.apply(TikaIO.parseAsStrings().from(filepattern)) > > > > This would of course be implemented as a ParDo or even MapElements, and > > you'll be able to share the code between parseAll and regular parse. > > > I'd like to understand how to do > > TikaIO.parse().from(filepattern) > > Right now I have TikaIO.Read extending > PTransform<PBegin, PCollection<ParseResult> > > and then the boilerplate code which builds Read when I do something like > > TikaIO.read().from(filepattern). > > What is the convention for supporting something like > TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see > some example ? > There are a number of IOs that don't use Source - e.g. DatastoreIO and JdbcIO. TextIO.readMatches() might be an even better transform to mimic. Note that in TikaIO you probably won't need a fusion break after the ParDo since there's 1 result per input file. > > Many thanks, Sergey >