Thanks for the review, please see the last comment:

https://github.com/apache/beam/pull/3835#issuecomment-333502388

(sorry for the possible duplication - but I'm not sure that GitHub will propagate it - as I can not see a comment there that I left on Saturday).

Cheers, Sergey
On 29/09/17 10:21, Sergey Beryozkin wrote:
Hi
On 28/09/17 17:09, Eugene Kirpichov wrote:
Hi! Glad the refactoring is happening, thanks!

Thanks for getting me focused on having TikaIO supporting the simpler (and practical) cases first :-)
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.
OK, thanks...

On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing ParseResult
and updating the tests but keeping the BounderSource/Reader, dropping
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on
what've done so far before completely removing BoundedSource/Reader
based code ?

Yes :)

I did commit yesterday to my branch, and it made its way to the pending PR (which I forgot about) where I only tweaked a couple of doc typos, so I renamed that PR:

https://github.com/apache/beam/pull/3835

(The build failures are apparently due to the build timeouts)

As I mentioned, in this PR I updated the existing TikaIO test to work with ParseResult, at the moment a file location as its property. Only a file name can easily be saved, I thought it might be important where on the network the file is - may be copy it afterwards if needed, etc. I'd also have no problems with having it typed as a K key, was only trying to make it a bit simpler at the start.

I'll deal with the new configurations after a switch. TikaConfig would most likely still need to be supported but I recall you mentioned the way it's done now will make it work only with the direct runner. I guess I can load it as a URL resource... The other bits like providing custom content handlers, parsers, input metadata, may be setting the max size of the files, etc, can all be added after a switch.

Note I haven't dealt with a number of your comments to the original code which can still be dealt with in the current code - given that most of that code will go with the next PR anyway.

Please review or merge if it looks like it is a step in the right direction...



I have another question anyway,


E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however,
you
can let the user specify either its class (if it's a simple handler like
a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can
have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

    PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
      .apply(FileIO.readMatches())
      .apply(TikaIO.parseAllAsStrings())

or:

      .apply(TikaIO.parseAll()
          .withParser(new AutoDetectParser())
          .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
      p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I do something like

TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
some example ?

There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
Note that in TikaIO you probably won't need a fusion break after the ParDo
since there's 1 result per input file.


OK, I'll have a look

Cheers, Sergey



Many thanks, Sergey


Reply via email to