Re: TikaIO Refactoring

Sergey Beryozkin Thu, 05 Oct 2017 10:12:56 -0700

Forgot to remove trim(), but will do it too

Sergey
On 05/10/17 18:09, Sergey Beryozkin wrote:

Hi Eugene

I've done an initial commit to do with removing TikaSource, more work isneeded and I see 3 tasks remaining:1) provide a shortcut which can let users avoid using FileIO directly,as you suggested earlier, at the moment I do:

https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99


but would love to be able to type something like this in the simple cases

PCollection<ParseResult> output =
         p.apply(TikaIO.parseAll().from(filePattern));

(note I hope to convince you to keep it as parseAll() as opposed toparseAllToString() :-) but it is a minor and separate issue).

What I don't understand here is how to do this shortcut without aPipeline instance, i.e, with explicit FileIO use it looks easy, onecreates a pipeline and then one applies to it FileIO and then connectsTikaIO via another apply(), but how to implementTikaIO.parseAll().from(filePattern) such that TikaIO links to FileIOinternally without .apply ?


2) Optimize ParseResult coder as you noted in the review

3) Finish it all with finalizing the configuration options (and enablingand enhancing display tests)

Have a look please, I wonder if it makes sense to merge to the masternow for me to do a follow up (and hopefully final) PR next


Cheers, Sergey

On 28/09/17 17:09, Eugene Kirpichov wrote:

Hi! Glad the refactoring is happening, thanks!
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.

On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <[email protected]>
wrote:

Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing ParseResult
and updating the tests but keeping the BounderSource/Reader, dropping
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on
what've done so far before completely removing BoundedSource/Reader
based code ?

Yes :)


I have another question anyway,

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>

- return a PCollection<KV<String, TikaIO.ParseResult>>, whereParseResult

is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however,

you

can let the user specify either its class (if it's a simple handlerlike

BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can

have

a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

    PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
      .apply(FileIO.readMatches())
      .apply(TikaIO.parseAllAsStrings())

or:

      .apply(TikaIO.parseAll()
          .withParser(new AutoDetectParser())
          .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
      p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I do something like

TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
some example ?

There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.

Note that in TikaIO you probably won't need a fusion break after theParDo

since there's 1 result per input file.


Many thanks, Sergey

Re: TikaIO Refactoring

Reply via email to