Hi Eugene

Thanks, I've addressed some of the latest comments, and tried to justify why some rather not be addressed (simplifying with Tika.parseToString(), removing TikaOptions - may be later).

I'll focus on adding a shortcut per the below suggestions, then the better coder, then more configuration options, and work with the review comments in between...

I'm traveling next week so not sure I'll have enough time to concentrate on this PR but will continue afterwards

Cheers, Sergey
On 06/10/17 02:30, Eugene Kirpichov wrote:
On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi Eugene

I've done an initial commit to do with removing TikaSource, more work is
needed and I see 3 tasks remaining:
1) provide a shortcut which can let users avoid using FileIO directly,
as you suggested earlier, at the moment I do:


https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99

but would love to be able to type something like this in the simple cases

PCollection<ParseResult> output =
          p.apply(TikaIO.parseAll().from(filePattern));

Yup, makes sense.



(note I hope to convince you to keep it as parseAll() as opposed to
parseAllToString() :-) but it is a minor and separate issue).

What I don't understand here is how to do this shortcut without a
Pipeline instance, i.e, with explicit FileIO use it looks easy, one
creates a pipeline and then one applies to it FileIO and then connects
TikaIO via another apply(), but how to implement
TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO
internally without .apply ?

It is a composite transform - it can construct arbitrary complex stuff
inside expand(). TikaIO.parseAll()'s expand() method might look something
like:

ParseAll.expand(PCollection<String> filepatterns) {
   return input.apply(FileIO.matchAll())
       .apply(FileIO.readMatches().withCompression(UNCOMPRESSED))
       .apply(TikaIO.parseFiles());
}

ParseFiles.expand(PCollection<ReadableFile> files) {
   return files.apply(ParDo.of(... whatever you have there now ...))
}



2) Optimize ParseResult coder as you noted in the review

3) Finish it all with finalizing the configuration options (and enabling
and enhancing display tests)

Have a look please, I wonder if it makes sense to merge to the master
now for me to do a follow up (and hopefully final) PR next

Cheers, Sergey

On 28/09/17 17:09, Eugene Kirpichov wrote:
Hi! Glad the refactoring is happening, thanks!
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.

On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing ParseResult
and updating the tests but keeping the BounderSource/Reader, dropping
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on
what've done so far before completely removing BoundedSource/Reader
based code ?

Yes :)



I have another question anyway,


E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where
ParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however,
you
can let the user specify either its class (if it's a simple handler
like
a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can
have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

     PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
       .apply(FileIO.readMatches())
       .apply(TikaIO.parseAllAsStrings())

or:

       .apply(TikaIO.parseAll()
           .withParser(new AutoDetectParser())
           .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
       p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I do something like

TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
some example ?

There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
Note that in TikaIO you probably won't need a fusion break after the
ParDo
since there's 1 result per input file.



Many thanks, Sergey




Reply via email to