[
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051839#comment-16051839
]
ASF GitHub Bot commented on BEAM-2328:
--------------------------------------
GitHub user sberyozkin opened a pull request:
https://github.com/apache/beam/pull/3378
[BEAM-2328] Add TikaIO component
R: @jbonofre
Adding TikaSource and TikaReader tests
Updating TikaReader to use TikaInputStream as suggested by Tim Allison
Supporting the customization of TikaConfig
Cleanup:
Moving a 'tika' above 'xml' in io/pom.xml to keep the correct order
Renaming TikaInput to TikaIO, adding Read.withOptions, throwing
NoSuchElementException if the current is null
Removing redundant test annotations
Fixing TikaIO JavaDoc typo
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sberyozkin/beam tikaio
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/3378.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3378
----
commit 8c63d91c0a088e2d90d5572051f736f24ea338b5
Author: Sergey Beryozkin <[email protected]>
Date: 2017-05-25T15:47:59Z
Adding TikaIO component
Enforcing that start is called before advance
Adding TikaSource and TikaReader tests
Updating TikaReader to use TikaInputStream as suggested by Tim Allison
Supporting the customization of TikaConfig
Moving a 'tika' above 'xml' in io/pom.xml to keep the correct order
Renaming TikaInput to TikaIO, adding Read.withOptions, throwing
NoSuchElementException if the current is null
Removing redundant test annotations
Fixing TikaIO JavaDoc typo
----
> Introduce Apache Tika Input component
> -------------------------------------
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
> Issue Type: New Feature
> Components: sdk-ideas, sdk-java-extensions
> Reporter: Sergey Beryozkin
> Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing
> the variety of file formats. It is used in many projects including Lucene and
> Elastic Search.
> Supporting a Tika Input (Read) at the Beam level would be of major interest
> to many users.
> PR is to follow
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)