Can TikaInputStream consume a regular InputStream? If so, you can apply it
to Channels.newInputStream(channel). If not, applying it to the filename
extracted from Metadata won't work either because it can point to a file
that's not on the local disk.

On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sberyoz...@gmail.com> wrote:

> I'm starting moving toward
>
> class TikaIO {
>    public static ParseAllToString parseAllToString() {..}
>    class ParseAllToString extends PTransform<PCollection<ReadableFile>,
> PCollection<ParseResult>> {
>      ...configuration properties...
>      expand {
>        return input.apply(ParDo.of(new ParseToStringFn))
>      }
>      class ParseToStringFn extends DoFn<...> {...}
>    }
> }
>
> as suggested by Eugene
>
> The initial migration seems to work fine, except that ReadableFile and
> in particular, ReadableByteChannel can not be consumed by
> TikaInputStream yet (I'll open an enhancement request), besides, it's
> better let Tika to unzip if needed given that a lot of effort went in
> Tika into detecting zip security issues...
>
> So I'm typing it as
>
> class ParseAllToString extends
> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>
> Cheers, Sergey
>
> On 02/10/17 12:03, Sergey Beryozkin wrote:
> > Thanks for the review, please see the last comment:
> >
> > https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >
> > (sorry for the possible duplication - but I'm not sure that GitHub will
> > propagate it - as I can not see a comment there that I left on Saturday).
> >
> > Cheers, Sergey
> > On 29/09/17 10:21, Sergey Beryozkin wrote:
> >> Hi
> >> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>> Hi! Glad the refactoring is happening, thanks!
> >>
> >> Thanks for getting me focused on having TikaIO supporting the simpler
> >> (and practical) cases first :-)
> >>> It was auto-assigned to Reuven as formal owner of the component. I
> >>> reassigned it to you.
> >> OK, thanks...
> >>>
> >>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I started looking at
> >>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>
> >>>> and pushed some initial code to my tikaio branch introducing
> >>>> ParseResult
> >>>> and updating the tests but keeping the BounderSource/Reader, dropping
> >>>> the asynchronous parsing code, and few other bits.
> >>>>
> >>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> >>>> into it too or was it auto-assigned ?
> >>>>
> >>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>> what've done so far before completely removing BoundedSource/Reader
> >>>> based code ?
> >>>>
> >>> Yes :)
> >>>
> >> I did commit yesterday to my branch, and it made its way to the
> >> pending PR (which I forgot about) where I only tweaked a couple of doc
> >> typos, so I renamed that PR:
> >>
> >> https://github.com/apache/beam/pull/3835
> >>
> >> (The build failures are apparently due to the build timeouts)
> >>
> >> As I mentioned, in this PR I updated the existing TikaIO test to work
> >> with ParseResult, at the moment a file location as its property. Only
> >> a file name can easily be saved, I thought it might be important where
> >> on the network the file is - may be copy it afterwards if needed, etc.
> >> I'd also have no problems with having it typed as a K key, was only
> >> trying to make it a bit simpler at the start.
> >>
> >> I'll deal with the new configurations after a switch. TikaConfig would
> >> most likely still need to be supported but I recall you mentioned the
> >> way it's done now will make it work only with the direct runner. I
> >> guess I can load it as a URL resource... The other bits like providing
> >> custom content handlers, parsers, input metadata, may be setting the
> >> max size of the files, etc, can all be added after a switch.
> >>
> >> Note I haven't dealt with a number of your comments to the original
> >> code which can still be dealt with in the current code - given that
> >> most of that code will go with the next PR anyway.
> >>
> >> Please review or merge if it looks like it is a step in the right
> >> direction...
> >>
> >>>
> >>>>
> >>>> I have another question anyway,
> >>>>
> >>>>
> >>>>> E.g. TikaIO could:
> >>>>> - take as input a PCollection<ReadableFile>
> >>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>> ParseResult
> >>>>> is a class with properties { String content, Metadata metadata }
> >>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>> specified at pipeline construction time) and a ContentHandler whose
> >>>>> toString() will go into "content". ContentHandler does not implement
> >>>>> Serializable, so you can not specify it at construction time -
> >>>>> however,
> >>>> you
> >>>>> can let the user specify either its class (if it's a simple handler
> >>>>> like
> >>>> a
> >>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
> >>>> have
> >>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>> TikaIO.parseAllAsStrings().
> >>>>>
> >>>>> Example usage would look like:
> >>>>>
> >>>>>     PCollection<KV<String, ParseResult>> parseResults =
> >>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>       .apply(FileIO.readMatches())
> >>>>>       .apply(TikaIO.parseAllAsStrings())
> >>>>>
> >>>>> or:
> >>>>>
> >>>>>       .apply(TikaIO.parseAll()
> >>>>>           .withParser(new AutoDetectParser())
> >>>>>           .withContentHandler(() -> new BodyContentHandler(new
> >>>>> ToXMLContentHandler())))
> >>>>>
> >>>>> You could also have shorthands for letting the user avoid using
> FileIO
> >>>>> directly in simple cases, for example:
> >>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>
> >>>>> This would of course be implemented as a ParDo or even MapElements,
> >>>>> and
> >>>>> you'll be able to share the code between parseAll and regular parse.
> >>>>>
> >>>> I'd like to understand how to do
> >>>>
> >>>> TikaIO.parse().from(filepattern)
> >>>>
> >>>> Right now I have TikaIO.Read extending
> >>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>
> >>>> and then the boilerplate code which builds Read when I do something
> >>>> like
> >>>>
> >>>> TikaIO.read().from(filepattern).
> >>>>
> >>>> What is the convention for supporting something like
> >>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
> >>>> see
> >>>> some example ?
> >>>>
> >>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
> >>> JdbcIO. TextIO.readMatches() might be an even better transform to
> mimic.
> >>> Note that in TikaIO you probably won't need a fusion break after the
> >>> ParDo
> >>> since there's 1 result per input file.
> >>>
> >>
> >> OK, I'll have a look
> >>
> >> Cheers, Sergey
> >>
> >>>
> >>>>
> >>>> Many thanks, Sergey
> >>>>
> >>>
>

Reply via email to