Re: Integrating Tika with Apache Beam

Sergey Beryozkin Thu, 21 Sep 2017 05:57:56 -0700

Hi Tim
Thanks, will link you to the thread shortly

In general, I'd say TikaIO has probably generated more interest thensome of the other Beam IOs which is a good sign :-)


The questions at the moment:
1) what interesting things can be done with the unordered Tika produced data

2) would it really help if users can write the custom functionsthemselves (I'd say the utility code always helps for some cases)

I also believe it would be possible to somehow make all the Tikaproduced data ordered in the end, but that would be the next phase...


At the moment it's those 2 issues which are the main ones...

Thanks, Sergey

P.S I'd not like this TikaIO idea to cause some 'battles' :-), I thinkit would be cool if Tika were one of the native Beam IOs (it would alsobe big for the tooling side of things), if not then indeed Tika userscan easily do something themselves on top of Beam

On 21/09/17 13:28, Allison, Timothy B. wrote:

Hi Sergey,

I just subscribed to Beam's dev list.  Can you forward me your latest email so 
that I can respond to the thread?  Or can you ping me via their list?  Thank 
you!

-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 21, 2017 5:53 AM
To: [email protected]
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

TikaIO is getting some serious attention now on the Beam dev, and unfortunately 
it is not all about it being a great addition to Beam.

The team is wondering what one can do with TikaIO vs someone just doing some 
custom Beam function.

TikaIO and as any other Bounded text reader will produce the data in the 
ordered way, but they can be made totally unordered to the pipeline by the Beam 
runtime.

I gave one example where we used the Tika output to save it all to Lucene (with 
the file name associated) and then search for the files which contain a certain 
word.

Tim, Chris, others, if you have some interesting examples to share where it did not 
matter in which order Tika-produced data were made eventually available, then please let 
me know, or reply directly to a Beam dev thread titled "TikaIO concerns".

Note, if Beam devs decide they don't want it then one option can be to create a 
tika-integrations/beam module and experiment there - I'm not saying it will 
need to be done but it's something that may be worth considering

Sergey
On 15/09/17 12:02, Sergey Beryozkin wrote:

Hi Chris

thanks,

at the moment TikaIO (originally renamed TikaReader as it can only
read but we renamed it to follow the convention) is a bounded reader,
so you can say ask it to read

/files/*.pdf

and it will read all the N files there, and will end the run.

I'm not sure yet what is the best strategy to making it the unbounded
reader where it can continuously poll or be notified of the new files
becoming available...There are some ideas about scheduling the bounded
Beam pipelines, haven't looked yet...

In the short term, the simplest solution would be simply to create a
new instance of TikaIO pipeline, and point it to the new temp folder
where a new batch of files has been dropped to.

Thanks, Sergey
On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:

Amazing work, thank you Sergey!!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development
Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


On 9/11/17, 7:33 AM, "Allison, Timothy B." <[email protected]> wrote:

      What great news!  Thank you, Sergey!!!
      -----Original Message-----
      From: Sergey Beryozkin [mailto:[email protected]]
      Sent: Monday, September 11, 2017 9:18 AM
      To: Allison, Timothy B. <[email protected]>;
[email protected]
      Subject: Re: Integrating Tika with Apache Beam
      Hi Tim, All
      It took it some time, but finally Beam TikaIO component is in
its 2.2.0-SNAPSHOT master,
      https://github.com/apache/beam/tree/master/sdks/java/io/tika
      I've created a basic project which can help with running it quickly:
      https://github.com/sberyozkin/beamTikaExample
      One can just build it and run as suggested in Readme.md, simply
have some PDF files for example, and point to one or all of them.
      By default, Beam will output the data to /tmp/tika.
      main() can be updated with supporting more options, they can be
collected from the command line either with TikaOptions:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main

/java/org/apache/beam/sdk/io/tika/TikaOptions.java

      (all options but the "--input" are optional)
      or directly from the code, some variations are shown in the tests:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test

/java/org/apache/beam/sdk/io/tika/TikaIOTest.java

      By default TikaReader will use an internal queue to make the SAX
events available to the Beam pipeline, this is why you can see the
options like "queuePollTime", etc. If it's known that a given parser
can really read the whole text in the single op only then the process
can be optimized with 'parseSynchronously'...
      One can also try to update main() in the example to do more
interesting things then just print the data :-).
      Give it a try please if you get a chance, help make TikeIO the
major part of Beam :-) with PRs, etc
      Thanks, Sergey
      On 25/05/17 17:47, Sergey Beryozkin wrote:
      > Hi Guys
      >
      > The link to the initial code is available in JIRA, at this
stage the
      > focus is on preparing a solid initial PR, and then we can all
improve
      > Tika related code :-)
      >
      > Cheers, Sergey
      > On 24/05/17 11:41, Sergey Beryozkin wrote:
      >> Hi Tim, All,
      >>
      >> I thought I'd start a dedicated thread.
      >>
      >> I added some initial comments to [1], I'm quite close now to
creating
      >> the initial PR.
      >>
      >> Thanks, Sergey
      >>
      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
      >>> Another idea...if you have any interest, it would be great
to get
      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use
it for
      >>> our regression tests?
      >>>
      >>> -----Original Message-----
      >>> From: Sergey Beryozkin [mailto:[email protected]]
      >>> Sent: Friday, May 19, 2017 4:21 PM
      >>> To: [email protected]
      >>> Subject: Re: Extracting Text from embedded images in PDF
docs
      >>>
      >>> Hi Tim
      >>>
      >>> Sure, once I get an initial PR ready I'll send an update and
I'll
      >>> explain what I did for a start and we will discuss it
further
      >>>
      >
      >



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Reply via email to