[
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022676#comment-16022676
]
Sergey Beryozkin commented on BEAM-2328:
----------------------------------------
Apache Tika Parsers report the content via the SAX events,
https://tika.apache.org/1.14/.
I'm implementing a TikaReader such that it adapts the sequence of SAX events to
the streaming BounderReader API by using the internal ExecutorService and the
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and
then advance(), it won't have to immediately parse the given file content. A
good number of Tika parsers can report the data in chunks thus the proposed
TikaReader implementation should be quite optimal.
Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika
Parsers will need to get the full control of the InputStream. However, should
the PR be accepted, then I would definitely see some scope for reusing some of
currently private FileBasedSource/Reader helpers such as for example the
composite reader which is used when a multiple files are picked up.
Right now I have a reasonably good starting code IMHO with the TikaInputTest
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and
optionally the parsed out metadata also being streamed.
Some of the code I copied from FileBasedSource might be suboptimal when applied
to the Tika case. I hope that if PR gets eventually accepted then, with the
help of Tika experts, there would be no doubt be more improvements coming in.
Planning to work in creating a branch and PR soon, cheers
> Introduce Apache Tika Input component
> -------------------------------------
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
> Issue Type: New Feature
> Components: sdk-ideas
> Reporter: Sergey Beryozkin
> Assignee: Davor Bonaci
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing
> the variety of file formats. It is used in many projects including Lucene and
> Elastic Search.
> Supporting a Tika Input (Read) at the Beam level would be of major interest
> to many users.
> PR is to follow
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)