Re: TikaIO concerns

Robert Bradshaw Wed, 20 Sep 2017 14:48:07 -0700

On Wed, Sep 20, 2017 at 2:17 PM, Sergey Beryozkin <[email protected]> wrote:
> Hi,
>
> thanks for the explanations,
>
> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>
>> Hi!
>>
>> TextIO returns an unordered soup of lines contained in all files you ask
>> it
>> to read. People usually use TextIO for reading files where 1 line
>> corresponds to 1 independent data element, e.g. a log entry, or a row of a
>> CSV file - so discarding order is ok.
>
> Just a side note, I'd probably want that be ordered, though I guess it
> depends...
>>
>> However, there is a number of cases where TextIO is a poor fit:
>> - Cases where discarding order is not ok - e.g. if you're doing natural
>> language processing and the text files contain actual prose, where you
>> need
>> to process a file as a whole. TextIO can't do that.
>> - Cases where you need to remember which file each element came from, e.g.
>> if you're creating a search index for the files: TextIO can't do this
>> either.
>>
>> Both of these issues have been raised in the past against TextIO; however
>> it seems that the overwhelming majority of users of TextIO use it for logs
>> or CSV files or alike, so solving these issues has not been a priority.
>> Currently they are solved in a general form via FileIO.read() which gives
>> you access to reading a full file yourself - people who want more
>> flexibility will be able to use standard Java text-parsing utilities on a
>> ReadableFile, without involving TextIO.
>>
>> Same applies for XmlIO: it is specifically designed for the narrow use
>> case
>> where the files contain independent data entries, so returning an
>> unordered
>> soup of them, with no association to the original file, is the user's
>> intention. XmlIO will not work for processing more complex XML files that
>> are not simply a sequence of entries with the same tag, and it also does
>> not remember the original filename.
>>
>
> OK...
>
>> However, if my understanding of Tika use cases is correct, it is mainly
>> used for extracting content from complex file formats - for example,
>> extracting text and images from PDF files or Word documents. I believe
>> this
>> is the main difference between it and TextIO - people usually use Tika for
>> complex use cases where the "unordered soup of stuff" abstraction is not
>> useful.
>>
>> My suspicion about this is confirmed by the fact that the crux of the Tika
>> API is ContentHandler
>>
>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
>> whose
>> documentation says "The order of events in this interface is very
>> important, and mirrors the order of information in the document itself."
>
> All that says is that a (Tika) ContentHandler will be a true SAX
> ContentHandler...
>>
>>
>> Let me give a few examples of what I think is possible with the raw Tika
>> API, but I think is not currently possible with TikaIO - please correct me
>> where I'm wrong, because I'm not particularly familiar with Tika and am
>> judging just based on what I read about it.
>> - User has 100,000 Word documents and wants to convert each of them to
>> text
>> files for future natural language processing.
>> - User has 100,000 PDF files with financial statements, each containing a
>> bunch of unrelated text and - the main content - a list of transactions in
>> PDF tables. User wants to extract each transaction as a PCollection
>> element, discarding the unrelated text.
>> - User has 100,000 PDF files with scientific papers, and wants to extract
>> text from them, somehow parse author and affiliation from the text, and
>> compute statistics of topics and terminology usage by author name and
>> affiliation.
>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>> observing a location over time: they want to extract metadata from each
>> image using Tika, analyze the images themselves using some other library,
>> and detect anomalies in the overall appearance of the location over time
>> as
>> seen from multiple cameras.
>> I believe all of these cases can not be solved with TikaIO because the
>> resulting PCollection<String> contains no information about which String
>> comes from which document and about the order in which they appear in the
>> document.
>
> These are good use cases, thanks... I thought what you were talking about
> the unordered soup of data produced by TikaIO (and its friends TextIO and
> alike :-)).
> Putting the ordered vs unordered question aside for a sec, why exactly a
> Tika Reader can not make the name of the file it's currently reading from
> available to the pipeline, as some Beam pipeline metadata piece ?
> Surely it can be possible with Beam ? If not then I would be surprised...
>
>>
>> I am, honestly, struggling to think of a case where I would want to use
>> Tika, but where I *would* be ok with getting an unordered soup of strings.
>> So some examples would be very helpful.
>>
> Yes. I'll ask Tika developers to help with some examples, but I'll give one
> example where it did not matter to us in what order Tika-produced data were
> available to the downstream layer.
>
> It's a demo the Apache CXF colleague of mine showed at one of Apache Con
> NAs, and we had a happy audience:
>
> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search
>
> PDF or ODT files uploaded, Tika parses them, and all of that is put into
> Lucene. We associate a file name with the indexed content and then let users
> find a list of PDF files which contain a given word or few words, details
> are here
> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131
>
> I'd say even more involved search engines would not mind supporting a case
> like that :-)
>
> Now there we process one file at a time, and I understand now that with
> TikaIO and N files it's all over the place really as far as the ordering is
> concerned, which file it's coming from. etc. That's why TikaReader must be
> able to associate the file name with a given piece of text it's making
> available to the pipeline.
>
> I'd be happy to support the ParDo way of linking Tika with Beam.
> If it makes things simpler then it would be good, I've just no idea at the
> moment how to start the pipeline without using a Source/Reader,
> but I'll learn :-).


This would be the (as yet unreleased) FileIO.readMatches and friends:

https://github.com/apache/beam/blob/6d4a78517708db3bd89cfeff5a7e62fb6b948e1d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L88

> Re the sync issue I mentioned earlier - how can one
> avoid it with ParDo when implementing a 'min len chunk' feature, where the
> ParDo would have to concatenate several SAX data pieces first before making
> a single composite piece to the pipeline ?
>
>
>> Another way to state it: currently, if I wanted to solve all of the use
>> cases above, I'd just use FileIO.readMatches() and use the Tika API myself
>> on the resulting ReadableFile. How can we make TikaIO provide a usability
>> improvement over such usage?

+1, this was exactly the same question I had.

> If you are actually asking, does it really make sense for Beam to ship
> Tika related code, given that users can just do it themselves, I'm not sure.
>
> IMHO it always works better if users have to provide just few config options
> to an integral part of the framework and see things happening.
> It will bring more users.
>
> Whether the current Tika code (refactored or not) stays with Beam or not -
> I'll let you and the team decide; believe it or not I was seriously
> contemplating at the last moment to make it all part of the Tika project
> itself and have a bit more flexibility over there with tweaking things, but
> now that it is in the Beam snapshot - I don't know - it's no my decision...

It is always an interesting question when one has two libraries X and
Y, plus some utility code that makes X work well with Y, where this
utility code should live. If this can be expressed primarily as X
which calls function using Y (in this particular example, Tika being
invoked in the body of a DoFn) there might not even be much such
library code (short of examples and documentation which can go a long
way here). On the other hand, in some cases there are advantages to
having a hybrid XY component that interleaves or otherwise joins
together the libraries in common or non-trivial ways--worth exploring
if that's the case here.

>> I am confused by your other comment - "Does the ordering matter ?  Perhaps
>> for some cases it does, and for some it does not. May be it makes sense to
>> support running TikaIO as both the bounded reader/source and ParDo, with
>> getting the common code reused." - because using BoundedReader or ParDo is
>> not related to the ordering issue, only to the issue of asynchronous
>> reading and complexity of implementation. The resulting PCollection will
>> be
>> unordered either way - this needs to be solved separately by providing a
>> different API.
>
> Right I see now, so ParDo is not about making Tika reported data available
> to the downstream pipeline components ordered, only about the simpler
> implementation.
> Association with the file should be possible I hope, but I understand it
> would be possible to optionally make the data coming out in the ordered way
> as well...
>
> Assuming TikaIO stays, and before trying to re-implement as ParDo, let me
> double check: should we still give some thought to the possible performance
> benefit of the current approach ? As I said, I can easily get rid of all
> that polling code, use a simple Blocking queue.

It's also a model and API question. For example, as mentioned above,
if it makes sense to invoke Tika entirely within the body of a DoFn
(where the input is a filename, and the output is interesting
data/chunks/whatever) to achieve the desired results this means one
doesn't need to worry about plumbing all the (likely evolving)
configuration and other options through from some Beam API through to
whatever interacts with the Tika objects. This helps with tooling,
documentation, user support, etc. as well as simply being more modular
and there being less code to write and maintain.

>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <[email protected]>
>> wrote:
>>
>>> Hi
>>>
>>> Glad TikaIO getting some serious attention :-), I believe one thing we
>>> both agree upon is that Tika can help Beam in its own unique way.
>>>
>>> Before trying to reply online, I'd like to state that my main assumption
>>> is that TikaIO (as far as the read side is concerned) is no different to
>>> Text, XML or similar bounded reader components.
>>>
>>> I have to admit I don't understand your questions about TikaIO usecases.
>>>
>>> What are the Text Input or XML input use-cases ? These use cases are
>>> TikaInput cases as well, the only difference is Tika can not split the
>>> individual file into a sequence of sources/etc,
>>>
>>> TextIO can read from the plain text files (possibly zipped), XML -
>>> optimized around reading from the XML files, and I thought I made it
>>> clear (and it is a known fact anyway) Tika was about reading basically
>>> from any file format.
>>>
>>> Where is the difference (apart from what I've already mentioned) ?
>>>
>>> Sergey
>>>
>>>
>>>
>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>
>>>> Hi,
>>>>
>>>> Replies inline.
>>>>
>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>>> novice, Apache Tika fan, and thought it would be really great to try
>>>>> and
>>>>> link both projects together, which led me to opening [1] where I typed
>>>>> some early thoughts, followed by PR [2].
>>>>>
>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>>>>> review comments from Eugene pending, so I'd like to summarize a bit why
>>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>>> feedback from the experts, what to do next.
>>>>>
>>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>>> events. It's not possible with Tika to take a file and read it bit by
>>>>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>>>>> is to handle the SAXParser callbacks which report the data chunks. Some
>>>>> parsers may report the complete lines, some individual words, with some
>>>>> being able report the data only after the completely parse the
>>>>> document.
>>>>> All depends on the data format.
>>>>>
>>>>> At the moment TikaIO's TikaReader does not use the Beam threads to
>>>>> parse
>>>>> the files, Beam threads will only collect the data from the internal
>>>>> queue where the internal TikaReader's thread will put the data into
>>>>> (note the data chunks are ordered even though the tests might suggest
>>>>> otherwise).
>>>>>
>>>> I agree that your implementation of reader returns records in order -
>>>> but
>>>> Beam PCollection's are not ordered. Nothing in Beam cares about the
>>>> order
>>>> in which records are produced by a BoundedReader - the order produced by
>>>> your reader is ignored, and when applying any transforms to the
>>>
>>> PCollection
>>>>
>>>> produced by TikaIO, it is impossible to recover the order in which your
>>>> reader returned the records.
>>>>
>>>> With that in mind, is PCollection<String>, containing individual
>>>> Tika-detected items, still the right API for representing the result of
>>>> parsing a large number of documents with Tika?
>>>>
>>>>
>>>>>
>>>>> The reason I did it was because I thought
>>>>>
>>>>> 1) it would make the individual data chunks available faster to the
>>>>> pipeline - the parser will continue working via the binary/video etc
>>>>> file while the data will already start flowing - I agree there should
>>>>> be
>>>>> some tests data available confirming it - but I'm positive at the
>>>>> moment
>>>>> this approach might yield some performance gains with the large sets.
>>>>> If
>>>>> the file is large, if it has the embedded attachments/videos to deal
>>>>> with, then it may be more effective not to get the Beam thread deal
>>>>> with
>>>>> it...
>>>>>
>>>>> As I said on the PR, this description contains unfounded and
>>>>> potentially
>>>>
>>>> incorrect assumptions about how Beam runners execute (or may execute in
>>>
>>> the
>>>>
>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>
>>> correctly,
>>>>
>>>> you might be assuming that:
>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>
>>> complete
>>>>
>>>> before processing its outputs with downstream transforms
>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>
>>> *concurrently*
>>>>
>>>> with downstream processing of its results
>>>> - Passing an element from one thread to another using a BlockingQueue is
>>>> free in terms of performance
>>>> All of these are false at least in some runners, and I'm almost certain
>>>> that in reality, performance of this approach is worse than a ParDo in
>>>
>>> most
>>>>
>>>> production runners.
>>>>
>>>> There are other disadvantages to this approach:
>>>> - Doing the bulk of the processing in a separate thread makes it
>>>
>>> invisible
>>>>
>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>> profiling capabilities, or the ability to get the current stack trace
>>>> for
>>>> stuck elements, this approach would make the real processing invisible
>>>> to
>>>> all of these capabilities, and a user would only see that the bulk of
>>>> the
>>>> time is spent waiting for the next element, but not *why* the next
>>>
>>> element
>>>>
>>>> is taking long to compute.
>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>> invisible
>>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>>>
>>> and
>>>>
>>>> other resource management magic (how much of this runners actually do is
>>>
>>> a
>>>>
>>>> separate issue), because the runner will have no way of knowing how much
>>>> CPU/IO this particular transform is actually using - all the processing
>>>> happens in a thread about which the runner is unaware.
>>>> - As far as I can tell, the code also hides exceptions that happen in
>>>> the
>>>> Tika thread
>>>> - Adding the thread management makes the code much more complex, easier
>>>
>>> to
>>>>
>>>> introduce bugs, and harder for others to contribute
>>>>
>>>>
>>>>> 2) As I commented at the end of [2], having an option to concatenate
>>>>> the
>>>>> data chunks first before making them available to the pipeline is
>>>>> useful, and I guess doing the same in ParDo would introduce some
>>>>> synchronization issues (though not exactly sure yet)
>>>>>
>>>> What are these issues?
>>>>
>>>>
>>>>>
>>>>> One of valid concerns there is that the reader is polling the internal
>>>>> queue so, in theory at least, and perhaps in some rare cases too, we
>>>>> may
>>>>> have a case where the max polling time has been reached, the parser is
>>>>> still busy, and TikaIO fails to report all the file data. I think that
>>>>> it can be solved by either 2a) configuring the max polling time to a
>>>>> very large number which will never be reached for a practical case, or
>>>>> 2b) simply use a blocking queue without the time limits - in the worst
>>>>> case, if TikaParser spins and fails to report the end of the document,
>>>>> then, Bean can heal itself if the pipeline blocks.
>>>>> I propose to follow 2b).
>>>>>
>>>> I agree that there should be no way to unintentionally configure the
>>>> transform in a way that will produce silent data loss. Another reason
>>>> for
>>>> not having these tuning knobs is that it goes against Beam's "no knobs"
>>>> philosophy, and that in most cases users have no way of figuring out a
>>>
>>> good
>>>>
>>>> value for tuning knobs except for manual experimentation, which is
>>>> extremely brittle and typically gets immediately obsoleted by running on
>>>
>>> a
>>>>
>>>> new dataset or updating a version of some of the involved dependencies
>>>
>>> etc.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Please let me know what you think.
>>>>> My plan so far is:
>>>>> 1) start addressing most of Eugene's comments which would require some
>>>>> minor TikaIO updates
>>>>> 2) work on removing the TikaSource internal code dealing with File
>>>>> patterns which I copied from TextIO at the next stage
>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users
>>>>> some
>>>>> time to try it with some real complex files and also decide if TikaIO
>>>>> can continue implemented as a BoundedSource/Reader or not
>>>>>
>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>
>>>> Yes, but I think we should start by discussing the anticipated use cases
>>>
>>> of
>>>>
>>>> TikaIO and designing an API for it based on those use cases; and then
>>>> see
>>>> what's the best implementation for that particular API and set of
>>>> anticipated use cases.
>>>>
>>>>
>>>>>
>>>>> Thanks, Sergey
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>
>>>>
>>>
>>
>

Re: TikaIO concerns

Reply via email to