Amazing work, thank you Sergey!!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development
Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattm...@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 9/11/17, 7:33 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
What great news! Thank you, Sergey!!!
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Monday, September 11, 2017 9:18 AM
To: Allison, Timothy B. <talli...@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam
Hi Tim, All
It took it some time, but finally Beam TikaIO component is in its
2.2.0-SNAPSHOT master,
https://github.com/apache/beam/tree/master/sdks/java/io/tika
I've created a basic project which can help with running it quickly:
https://github.com/sberyozkin/beamTikaExample
One can just build it and run as suggested in Readme.md, simply
have some PDF files for example, and point to one or all of them.
By default, Beam will output the data to /tmp/tika.
main() can be updated with supporting more options, they can be
collected from the command line either with TikaOptions:
https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
(all options but the "--input" are optional)
or directly from the code, some variations are shown in the tests:
https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
By default TikaReader will use an internal queue to make the SAX
events available to the Beam pipeline, this is why you can see the
options like "queuePollTime", etc. If it's known that a given parser
can really read the whole text in the single op only then the process
can be optimized with 'parseSynchronously'...
One can also try to update main() in the example to do more
interesting things then just print the data :-).
Give it a try please if you get a chance, help make TikeIO the
major part of Beam :-) with PRs, etc
Thanks, Sergey
On 25/05/17 17:47, Sergey Beryozkin wrote:
> Hi Guys
>
> The link to the initial code is available in JIRA, at this
stage the
> focus is on preparing a solid initial PR, and then we can all
improve
> Tika related code :-)
>
> Cheers, Sergey
> On 24/05/17 11:41, Sergey Beryozkin wrote:
>> Hi Tim, All,
>>
>> I thought I'd start a dedicated thread.
>>
>> I added some initial comments to [1], I'm quite close now to
creating
>> the initial PR.
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great to
get
>>> Apache Beam set up on our Rackspace VM (with Spark?) and use
it for
>>> our regression tests?
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
>>> Sent: Friday, May 19, 2017 4:21 PM
>>> To: u...@tika.apache.org
>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>
>>> Hi Tim
>>>
>>> Sure, once I get an initial PR ready I'll send an update and
I'll
>>> explain what I did for a start and we will discuss it further
>>>
>
>