[
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051834#comment-16051834
]
Sergey Beryozkin commented on BEAM-2328:
----------------------------------------
HI All,
The initial cleanup of the 'tikaio' branch is now complete (with thanks to JB),
the commits - squashed, I'm now proceeding to creating the first PR. I'd like
to ask JB to review it, the feedback from all of the team will also be welcomed.
[[email protected]] Hi Tim, I hope that if the team accepts this PR then we
can get TikaReader improved further :-). (I'm not sure if some more work will
need to be done to make a better reporting of the embedded attachments inside a
given PDF/etc, if some further ParserContext customizations may be needed - the
input metadata and TikaConfig are covered though, etc); concatenating multiple
SAX content bits into a minimum length fragments will optionally be supported
too later on if needed
thanks
> Introduce Apache Tika Input component
> -------------------------------------
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
> Issue Type: New Feature
> Components: sdk-ideas, sdk-java-extensions
> Reporter: Sergey Beryozkin
> Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing
> the variety of file formats. It is used in many projects including Lucene and
> Elastic Search.
> Supporting a Tika Input (Read) at the Beam level would be of major interest
> to many users.
> PR is to follow
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)