Re: TikaIO concerns

Sergey Beryozkin Thu, 21 Sep 2017 03:16:14 -0700

Thanks for the comments,

On 20/09/17 22:46, Robert Bradshaw wrote:

On Wed, Sep 20, 2017 at 2:17 PM, Sergey Beryozkin <[email protected]> wrote:

Hi,


thanks for the explanations,

On 20/09/17 16:41, Eugene Kirpichov wrote:


Hi!

TextIO returns an unordered soup of lines contained in all files you ask
it
to read. People usually use TextIO for reading files where 1 line
corresponds to 1 independent data element, e.g. a log entry, or a row of a
CSV file - so discarding order is ok.


Just a side note, I'd probably want that be ordered, though I guess it
depends...


However, there is a number of cases where TextIO is a poor fit:
- Cases where discarding order is not ok - e.g. if you're doing natural
language processing and the text files contain actual prose, where you
need
to process a file as a whole. TextIO can't do that.
- Cases where you need to remember which file each element came from, e.g.
if you're creating a search index for the files: TextIO can't do this
either.

Both of these issues have been raised in the past against TextIO; however
it seems that the overwhelming majority of users of TextIO use it for logs
or CSV files or alike, so solving these issues has not been a priority.
Currently they are solved in a general form via FileIO.read() which gives
you access to reading a full file yourself - people who want more
flexibility will be able to use standard Java text-parsing utilities on a
ReadableFile, without involving TextIO.

Same applies for XmlIO: it is specifically designed for the narrow use
case
where the files contain independent data entries, so returning an
unordered
soup of them, with no association to the original file, is the user's
intention. XmlIO will not work for processing more complex XML files that
are not simply a sequence of entries with the same tag, and it also does
not remember the original filename.


OK...

However, if my understanding of Tika use cases is correct, it is mainly
used for extracting content from complex file formats - for example,
extracting text and images from PDF files or Word documents. I believe
this
is the main difference between it and TextIO - people usually use Tika for
complex use cases where the "unordered soup of stuff" abstraction is not
useful.

My suspicion about this is confirmed by the fact that the crux of the Tika
API is ContentHandler

http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
whose
documentation says "The order of events in this interface is very
important, and mirrors the order of information in the document itself."


All that says is that a (Tika) ContentHandler will be a true SAX
ContentHandler...



Let me give a few examples of what I think is possible with the raw Tika
API, but I think is not currently possible with TikaIO - please correct me
where I'm wrong, because I'm not particularly familiar with Tika and am
judging just based on what I read about it.
- User has 100,000 Word documents and wants to convert each of them to
text
files for future natural language processing.
- User has 100,000 PDF files with financial statements, each containing a
bunch of unrelated text and - the main content - a list of transactions in
PDF tables. User wants to extract each transaction as a PCollection
element, discarding the unrelated text.
- User has 100,000 PDF files with scientific papers, and wants to extract
text from them, somehow parse author and affiliation from the text, and
compute statistics of topics and terminology usage by author name and
affiliation.
- User has 100,000 photos in JPEG made by a set of automatic cameras
observing a location over time: they want to extract metadata from each
image using Tika, analyze the images themselves using some other library,
and detect anomalies in the overall appearance of the location over time
as
seen from multiple cameras.
I believe all of these cases can not be solved with TikaIO because the
resulting PCollection<String> contains no information about which String
comes from which document and about the order in which they appear in the
document.


These are good use cases, thanks... I thought what you were talking about
the unordered soup of data produced by TikaIO (and its friends TextIO and
alike :-)).
Putting the ordered vs unordered question aside for a sec, why exactly a
Tika Reader can not make the name of the file it's currently reading from
available to the pipeline, as some Beam pipeline metadata piece ?
Surely it can be possible with Beam ? If not then I would be surprised...


I am, honestly, struggling to think of a case where I would want to use
Tika, but where I *would* be ok with getting an unordered soup of strings.
So some examples would be very helpful.

Yes. I'll ask Tika developers to help with some examples, but I'll give one
example where it did not matter to us in what order Tika-produced data were
available to the downstream layer.

It's a demo the Apache CXF colleague of mine showed at one of Apache Con
NAs, and we had a happy audience:

https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search

PDF or ODT files uploaded, Tika parses them, and all of that is put into
Lucene. We associate a file name with the indexed content and then let users
find a list of PDF files which contain a given word or few words, details
are here
https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131

I'd say even more involved search engines would not mind supporting a case
like that :-)

Now there we process one file at a time, and I understand now that with
TikaIO and N files it's all over the place really as far as the ordering is
concerned, which file it's coming from. etc. That's why TikaReader must be
able to associate the file name with a given piece of text it's making
available to the pipeline.

I'd be happy to support the ParDo way of linking Tika with Beam.
If it makes things simpler then it would be good, I've just no idea at the
moment how to start the pipeline without using a Source/Reader,
but I'll learn :-).


This would be the (as yet unreleased) FileIO.readMatches and friends:

https://github.com/apache/beam/blob/6d4a78517708db3bd89cfeff5a7e62fb6b948e1d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L88


OK, thanks;

Re the sync issue I mentioned earlier - how can one
avoid it with ParDo when implementing a 'min len chunk' feature, where the
ParDo would have to concatenate several SAX data pieces first before making
a single composite piece to the pipeline ?

Another way to state it: currently, if I wanted to solve all of the use
cases above, I'd just use FileIO.readMatches() and use the Tika API myself
on the resulting ReadableFile. How can we make TikaIO provide a usability
improvement over such usage?


+1, this was exactly the same question I had.

TikaIO PR was more than 3 months old by the time it got merged. I'mpretty sure in one of my comments in JIRA I mentioned I'd welcome afeedback from all of the team.

I realize that one can just start a pipeline with a soon to be releasedFileIO and do something very specific with some files in the functions.Jumping a bit ahead, but IMHO it's still useful to have a utilitysupport for working with Tika. In my own work I see users adapting acertain feature much much faster if there's a utility support eventhough in our project we have all the support for people writing theirown custom features...

If you are actually asking, does it really make sense for Beam to ship
Tika related code, given that users can just do it themselves, I'm not sure.

IMHO it always works better if users have to provide just few config options
to an integral part of the framework and see things happening.
It will bring more users.

Whether the current Tika code (refactored or not) stays with Beam or not -
I'll let you and the team decide; believe it or not I was seriously
contemplating at the last moment to make it all part of the Tika project
itself and have a bit more flexibility over there with tweaking things, but
now that it is in the Beam snapshot - I don't know - it's no my decision...


It is always an interesting question when one has two libraries X and
Y, plus some utility code that makes X work well with Y, where this
utility code should live. If this can be expressed primarily as X
which calls function using Y (in this particular example, Tika being
invoked in the body of a DoFn) there might not even be much such
library code (short of examples and documentation which can go a long
way here). On the other hand, in some cases there are advantages to
having a hybrid XY component that interleaves or otherwise joins
together the libraries in common or non-trivial ways--worth exploring
if that's the case here.

+1

I am confused by your other comment - "Does the ordering matter ?  Perhaps
for some cases it does, and for some it does not. May be it makes sense to
support running TikaIO as both the bounded reader/source and ParDo, with
getting the common code reused." - because using BoundedReader or ParDo is
not related to the ordering issue, only to the issue of asynchronous
reading and complexity of implementation. The resulting PCollection will
be
unordered either way - this needs to be solved separately by providing a
different API.


Right I see now, so ParDo is not about making Tika reported data available
to the downstream pipeline components ordered, only about the simpler
implementation.
Association with the file should be possible I hope, but I understand it
would be possible to optionally make the data coming out in the ordered way
as well...

Assuming TikaIO stays, and before trying to re-implement as ParDo, let me
double check: should we still give some thought to the possible performance
benefit of the current approach ? As I said, I can easily get rid of all
that polling code, use a simple Blocking queue.


It's also a model and API question. For example, as mentioned above,
if it makes sense to invoke Tika entirely within the body of a DoFn
(where the input is a filename, and the output is interesting
data/chunks/whatever) to achieve the desired results this means one
doesn't need to worry about plumbing all the (likely evolving)
configuration and other options through from some Beam API through to
whatever interacts with the Tika objects. This helps with tooling,
documentation, user support, etc. as well as simply being more modular
and there being less code to write and maintain.

Well, as far as Tika is concerned, the way it can be configured is notgoing to change, I can't think of the reason why.Speaking about the tooling: IMHO it will be easier for the teamsconsidering wiring Tika with Beam to have a Beam TikaIO component.

The custom approach won't really make it into the tooling...

Thanks, Sergey

On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <[email protected]>
wrote:

Hi

Glad TikaIO getting some serious attention :-), I believe one thing we
both agree upon is that Tika can help Beam in its own unique way.

Before trying to reply online, I'd like to state that my main assumption
is that TikaIO (as far as the read side is concerned) is no different to
Text, XML or similar bounded reader components.

I have to admit I don't understand your questions about TikaIO usecases.

What are the Text Input or XML input use-cases ? These use cases are
TikaInput cases as well, the only difference is Tika can not split the
individual file into a sequence of sources/etc,

TextIO can read from the plain text files (possibly zipped), XML -
optimized around reading from the XML files, and I thought I made it
clear (and it is a known fact anyway) Tika was about reading basically
from any file format.

Where is the difference (apart from what I've already mentioned) ?

Sergey



On 19/09/17 23:29, Eugene Kirpichov wrote:


Hi,

Replies inline.

On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <[email protected]>
wrote:

Hi All

This is my first post the the dev list, I work for Talend, I'm a Beam
novice, Apache Tika fan, and thought it would be really great to try
and
link both projects together, which led me to opening [1] where I typed
some early thoughts, followed by PR [2].

I noticed yesterday I had the robust :-) (but useful and helpful) newer
review comments from Eugene pending, so I'd like to summarize a bit why
I did TikaIO (reader) the way I did, and then decide, based on the
feedback from the experts, what to do next.

Apache Tika Parsers report the text content in chunks, via SaxParser
events. It's not possible with Tika to take a file and read it bit by
bit at the 'initiative' of the Beam Reader, line by line, the only way
is to handle the SAXParser callbacks which report the data chunks. Some
parsers may report the complete lines, some individual words, with some
being able report the data only after the completely parse the
document.
All depends on the data format.

At the moment TikaIO's TikaReader does not use the Beam threads to
parse
the files, Beam threads will only collect the data from the internal
queue where the internal TikaReader's thread will put the data into
(note the data chunks are ordered even though the tests might suggest
otherwise).

I agree that your implementation of reader returns records in order -
but
Beam PCollection's are not ordered. Nothing in Beam cares about the
order
in which records are produced by a BoundedReader - the order produced by
your reader is ignored, and when applying any transforms to the


PCollection


produced by TikaIO, it is impossible to recover the order in which your
reader returned the records.

With that in mind, is PCollection<String>, containing individual
Tika-detected items, still the right API for representing the result of
parsing a large number of documents with Tika?


The reason I did it was because I thought

1) it would make the individual data chunks available faster to the
pipeline - the parser will continue working via the binary/video etc
file while the data will already start flowing - I agree there should
be
some tests data available confirming it - but I'm positive at the
moment
this approach might yield some performance gains with the large sets.
If
the file is large, if it has the embedded attachments/videos to deal
with, then it may be more effective not to get the Beam thread deal
with
it...

As I said on the PR, this description contains unfounded and
potentially


incorrect assumptions about how Beam runners execute (or may execute in

the


future) a ParDo or a BoundedReader. For example, if I understand


correctly,


you might be assuming that:
- Beam runners wait for a full @ProcessElement call of a ParDo to


complete


before processing its outputs with downstream transforms
- Beam runners can not run a @ProcessElement call of a ParDo


*concurrently*


with downstream processing of its results
- Passing an element from one thread to another using a BlockingQueue is
free in terms of performance
All of these are false at least in some runners, and I'm almost certain
that in reality, performance of this approach is worse than a ParDo in


most


production runners.

There are other disadvantages to this approach:
- Doing the bulk of the processing in a separate thread makes it


invisible


to Beam's instrumentation. If a Beam runner provided per-transform
profiling capabilities, or the ability to get the current stack trace
for
stuck elements, this approach would make the real processing invisible
to
all of these capabilities, and a user would only see that the bulk of
the
time is spent waiting for the next element, but not *why* the next


element


is taking long to compute.
- Likewise, offloading all the CPU and IO to a separate thread,
invisible
to Beam, will make it harder for runners to do autoscaling, binpacking

and


other resource management magic (how much of this runners actually do is


separate issue), because the runner will have no way of knowing how much
CPU/IO this particular transform is actually using - all the processing
happens in a thread about which the runner is unaware.
- As far as I can tell, the code also hides exceptions that happen in
the
Tika thread
- Adding the thread management makes the code much more complex, easier

to


introduce bugs, and harder for others to contribute

2) As I commented at the end of [2], having an option to concatenate
the
data chunks first before making them available to the pipeline is
useful, and I guess doing the same in ParDo would introduce some
synchronization issues (though not exactly sure yet)

What are these issues?


One of valid concerns there is that the reader is polling the internal
queue so, in theory at least, and perhaps in some rare cases too, we
may
have a case where the max polling time has been reached, the parser is
still busy, and TikaIO fails to report all the file data. I think that
it can be solved by either 2a) configuring the max polling time to a
very large number which will never be reached for a practical case, or
2b) simply use a blocking queue without the time limits - in the worst
case, if TikaParser spins and fails to report the end of the document,
then, Bean can heal itself if the pipeline blocks.
I propose to follow 2b).

I agree that there should be no way to unintentionally configure the
transform in a way that will produce silent data loss. Another reason
for
not having these tuning knobs is that it goes against Beam's "no knobs"
philosophy, and that in most cases users have no way of figuring out a


good


value for tuning knobs except for manual experimentation, which is
extremely brittle and typically gets immediately obsoleted by running on


new dataset or updating a version of some of the involved dependencies


etc.



Please let me know what you think.
My plan so far is:
1) start addressing most of Eugene's comments which would require some
minor TikaIO updates
2) work on removing the TikaSource internal code dealing with File
patterns which I copied from TextIO at the next stage
3) If needed - mark TikaIO Experimental to give Tika and Beam users
some
time to try it with some real complex files and also decide if TikaIO
can continue implemented as a BoundedSource/Reader or not

Eugene, all, will it work if I start with 1) ?

Yes, but I think we should start by discussing the anticipated use cases

of


TikaIO and designing an API for it based on those use cases; and then
see
what's the best implementation for that particular API and set of
anticipated use cases.


Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378

Re: TikaIO concerns

Reply via email to