Hi! Glad the refactoring is happening, thanks!
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.

On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

> Hi
>
> I started looking at
> https://issues.apache.org/jira/browse/BEAM-2994
>
> and pushed some initial code to my tikaio branch introducing ParseResult
> and updating the tests but keeping the BounderSource/Reader, dropping
> the asynchronous parsing code, and few other bits.
>
> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> into it too or was it auto-assigned ?
>
> I don't mind, would it make sense for me to do an 'interim' PR on
> what've done so far before completely removing BoundedSource/Reader
> based code ?
>
Yes :)


>
> I have another question anyway,
>
>
> > E.g. TikaIO could:
> > - take as input a PCollection<ReadableFile>
> > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> > is a class with properties { String content, Metadata metadata }
> > - be configured by: a Parser (it implements Serializable so can be
> > specified at pipeline construction time) and a ContentHandler whose
> > toString() will go into "content". ContentHandler does not implement
> > Serializable, so you can not specify it at construction time - however,
> you
> > can let the user specify either its class (if it's a simple handler like
> a
> > BodyContentHandler) or specify a lambda for creating the handler
> > (SerializableFunction<Void, ContentHandler>), and potentially you can
> have
> > a simpler facade for Tika.parseAsString() - e.g. call it
> > TikaIO.parseAllAsStrings().
> >
> > Example usage would look like:
> >
> >    PCollection<KV<String, ParseResult>> parseResults =
> > p.apply(FileIO.match().filepattern(...))
> >      .apply(FileIO.readMatches())
> >      .apply(TikaIO.parseAllAsStrings())
> >
> > or:
> >
> >      .apply(TikaIO.parseAll()
> >          .withParser(new AutoDetectParser())
> >          .withContentHandler(() -> new BodyContentHandler(new
> > ToXMLContentHandler())))
> >
> > You could also have shorthands for letting the user avoid using FileIO
> > directly in simple cases, for example:
> >      p.apply(TikaIO.parseAsStrings().from(filepattern))
> >
> > This would of course be implemented as a ParDo or even MapElements, and
> > you'll be able to share the code between parseAll and regular parse.
> >
> I'd like to understand how to do
>
> TikaIO.parse().from(filepattern)
>
> Right now I have TikaIO.Read extending
> PTransform<PBegin, PCollection<ParseResult>
>
> and then the boilerplate code which builds Read when I do something like
>
> TikaIO.read().from(filepattern).
>
> What is the convention for supporting something like
> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
> some example ?
>
There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
Note that in TikaIO you probably won't need a fusion break after the ParDo
since there's 1 result per input file.


>
> Many thanks, Sergey
>

Reply via email to