Re: TikaIO Refactoring

Sergey Beryozkin Wed, 04 Oct 2017 10:09:03 -0700

I'm starting moving toward

class TikaIO {
  public static ParseAllToString parseAllToString() {..}

class ParseAllToString extends PTransform<PCollection<ReadableFile>,PCollection<ParseResult>> {

    ...configuration properties...
    expand {
      return input.apply(ParDo.of(new ParseToStringFn))
    }
    class ParseToStringFn extends DoFn<...> {...}
  }
}


as suggested by Eugene

The initial migration seems to work fine, except that ReadableFile andin particular, ReadableByteChannel can not be consumed byTikaInputStream yet (I'll open an enhancement request), besides, it'sbetter let Tika to unzip if needed given that a lot of effort went inTika into detecting zip security issues...


So I'm typing it as

class ParseAllToString extendsPTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>


Cheers, Sergey

On 02/10/17 12:03, Sergey Beryozkin wrote:

Thanks for the review, please see the last comment:

https://github.com/apache/beam/pull/3835#issuecomment-333502388
(sorry for the possible duplication - but I'm not sure that GitHub willpropagate it - as I can not see a comment there that I left on Saturday).
Cheers, Sergey
On 29/09/17 10:21, Sergey Beryozkin wrote:
Hi
On 28/09/17 17:09, Eugene Kirpichov wrote:
Hi! Glad the refactoring is happening, thanks!
Thanks for getting me focused on having TikaIO supporting the simpler(and practical) cases first :-)
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.
OK, thanks...
On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <[email protected]>
wrote:
Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994
and pushed some initial code to my tikaio branch introducingParseResult
and updating the tests but keeping the BounderSource/Reader, dropping
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on
what've done so far before completely removing BoundedSource/Reader
based code ?
Yes :)
I did commit yesterday to my branch, and it made its way to thepending PR (which I forgot about) where I only tweaked a couple of doctypos, so I renamed that PR:
https://github.com/apache/beam/pull/3835

(The build failures are apparently due to the build timeouts)
As I mentioned, in this PR I updated the existing TikaIO test to workwith ParseResult, at the moment a file location as its property. Onlya file name can easily be saved, I thought it might be important whereon the network the file is - may be copy it afterwards if needed, etc.I'd also have no problems with having it typed as a K key, was onlytrying to make it a bit simpler at the start.
I'll deal with the new configurations after a switch. TikaConfig wouldmost likely still need to be supported but I recall you mentioned theway it's done now will make it work only with the direct runner. Iguess I can load it as a URL resource... The other bits like providingcustom content handlers, parsers, input metadata, may be setting themax size of the files, etc, can all be added after a switch.
Note I haven't dealt with a number of your comments to the originalcode which can still be dealt with in the current code - given thatmost of that code will go with the next PR anyway.
Please review or merge if it looks like it is a step in the rightdirection...
I have another question anyway,
E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, whereParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time -however,
you
can let the user specify either its class (if it's a simple handlerlike
a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can
have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

    PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
      .apply(FileIO.readMatches())
      .apply(TikaIO.parseAllAsStrings())

or:

      .apply(TikaIO.parseAll()
          .withParser(new AutoDetectParser())
          .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
      p.apply(TikaIO.parseAsStrings().from(filepattern))
This would of course be implemented as a ParDo or even MapElements,and
you'll be able to share the code between parseAll and regular parse.
I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>
and then the boilerplate code which builds Read when I do somethinglike
TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can Isee
some example ?
There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
Note that in TikaIO you probably won't need a fusion break after theParDo
since there's 1 result per input file.
OK, I'll have a look

Cheers, Sergey
Many thanks, Sergey

Re: TikaIO Refactoring

Reply via email to