Re: Integrating Tika with Apache Beam

Sergey Beryozkin Thu, 21 Sep 2017 02:52:58 -0700

Hi Guys

TikaIO is getting some serious attention now on the Beam dev, andunfortunately it is not all about it being a great addition to Beam.

The team is wondering what one can do with TikaIO vs someone just doingsome custom Beam function.

TikaIO and as any other Bounded text reader will produce the data in theordered way, but they can be made totally unordered to the pipeline bythe Beam runtime.

I gave one example where we used the Tika output to save it all toLucene (with the file name associated) and then search for the fileswhich contain a certain word.

Tim, Chris, others, if you have some interesting examples to share whereit did not matter in which order Tika-produced data were made eventuallyavailable, then please let me know, or reply directly to a Beam devthread titled "TikaIO concerns".

Note, if Beam devs decide they don't want it then one option can be tocreate a tika-integrations/beam module and experiment there - I'm notsaying it will need to be done but it's something that may be worthconsidering


Sergey
On 15/09/17 12:02, Sergey Beryozkin wrote:

Hi Chris

thanks,
at the moment TikaIO (originally renamed TikaReader as it can only readbut we renamed it to follow the convention) is a bounded reader, so youcan say ask it to read
/files/*.pdf

and it will read all the N files there, and will end the run.
I'm not sure yet what is the best strategy to making it the unboundedreader where it can continuously poll or be notified of the new filesbecoming available...There are some ideas about scheduling the boundedBeam pipelines, haven't looked yet...
In the short term, the simplest solution would be simply to create a newinstance of TikaIO pipeline, and point it to the new temp folder where anew batch of files has been dropped to.
Thanks, Sergey
On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
Amazing work, thank you Sergey!!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and DevelopmentOffices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 9/11/17, 7:33 AM, "Allison, Timothy B." <[email protected]> wrote:

     What great news!  Thank you, Sergey!!!
     -----Original Message-----
     From: Sergey Beryozkin [mailto:[email protected]]
     Sent: Monday, September 11, 2017 9:18 AM
     To: Allison, Timothy B. <[email protected]>; [email protected]
     Subject: Re: Integrating Tika with Apache Beam
     Hi Tim, All
It took it some time, but finally Beam TikaIO component is in its2.2.0-SNAPSHOT master,
     https://github.com/apache/beam/tree/master/sdks/java/io/tika
     I've created a basic project which can help with running it quickly:
     https://github.com/sberyozkin/beamTikaExample
One can just build it and run as suggested in Readme.md, simplyhave some PDF files for example, and point to one or all of them.
     By default, Beam will output the data to /tmp/tika.
main() can be updated with supporting more options, they can becollected from the command line either with TikaOptions:https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
     (all options but the "--input" are optional)
     or directly from the code, some variations are shown in the tests:
https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java By default TikaReader will use an internal queue to make the SAXevents available to the Beam pipeline, this is why you can see theoptions like "queuePollTime", etc. If it's known that a given parsercan really read the whole text in the single op only then the processcan be optimized with 'parseSynchronously'... One can also try to update main() in the example to do moreinteresting things then just print the data :-). Give it a try please if you get a chance, help make TikeIO themajor part of Beam :-) with PRs, etc
     Thanks, Sergey
     On 25/05/17 17:47, Sergey Beryozkin wrote:
     > Hi Guys
     >
> The link to the initial code is available in JIRA, at thisstage the > focus is on preparing a solid initial PR, and then we can allimprove
     > Tika related code :-)
     >
     > Cheers, Sergey
     > On 24/05/17 11:41, Sergey Beryozkin wrote:
     >> Hi Tim, All,
     >>
     >> I thought I'd start a dedicated thread.
     >>
>> I added some initial comments to [1], I'm quite close now tocreating
     >> the initial PR.
     >>
     >> Thanks, Sergey
     >>
     >> [1] https://issues.apache.org/jira/browse/BEAM-2328
     >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great toget >>> Apache Beam set up on our Rackspace VM (with Spark?) and useit for
     >>> our regression tests?
     >>>
     >>> -----Original Message-----
     >>> From: Sergey Beryozkin [mailto:[email protected]]
     >>> Sent: Friday, May 19, 2017 4:21 PM
     >>> To: [email protected]
     >>> Subject: Re: Extracting Text from embedded images in PDF docs
     >>>
     >>> Hi Tim
     >>>
>>> Sure, once I get an initial PR ready I'll send an update andI'll
     >>> explain what I did for a start and we will discuss it further
     >>>
     >
     >



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Reply via email to