Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing ParseResult and updating the tests but keeping the BounderSource/Reader, dropping the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on what've done so far before completely removing BoundedSource/Reader based code ?

I have another question anyway,


E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however, you
can let the user specify either its class (if it's a simple handler like a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

   PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
     .apply(FileIO.readMatches())
     .apply(TikaIO.parseAllAsStrings())

or:

     .apply(TikaIO.parseAll()
         .withParser(new AutoDetectParser())
         .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
     p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I do something like

TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see some example ?

Many thanks, Sergey

Reply via email to