[
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032825#comment-16032825
]
Sergey Beryozkin commented on BEAM-2328:
----------------------------------------
I've added some TikaReader and TikaSource tests. Tika version was updated to
1.15 (released by [[email protected]]) and common-compress to 1.14 (see
TIKA-2099 for example).
In general I'd like to keep an initial contribution very much isolated, and
then later on follow up with some optimizations which would affect some other
Beam modules. Specifically, the two most immediate follow up PRs would be about
updating a managed Beam common compress dependency to 1.14 and remove the
version from tika/pom.xml and attempt to refactor a bit a FileBasedSource
composite reader such that its code can be reused by TikaSource.
The last thing I'd like to investigate for a start is to check what may need to
be done around non UTF-8 charsets. I don't expect TikaReader producing anything
else but Strings though.
I'm away next week, will start preparing for the initial PR shortly afterwards
> Introduce Apache Tika Input component
> -------------------------------------
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
> Issue Type: New Feature
> Components: sdk-ideas, sdk-java-extensions
> Reporter: Sergey Beryozkin
> Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing
> the variety of file formats. It is used in many projects including Lucene and
> Elastic Search.
> Supporting a Tika Input (Read) at the Beam level would be of major interest
> to many users.
> PR is to follow
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)