RE: TikaIO concerns

Allison, Timothy B. Thu, 21 Sep 2017 14:23:35 -0700

Like Sergey, it’ll take me some time to understand your recommendations.  Thank 
you!


On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
>class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the 
RecursiveParserWrapper that contains metadata and content, and captures 
metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce 
> very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <talli...@mitre.org>; dev@beam.apache.org
Cc: d...@tika.apache.org
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both 
within-document order and association with the original filename are necessary, 
but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with 
the file it originated from: automatically tracking data provenance is a known 
very hard research problem on which many papers have been written, and obvious 
solutions are very easy to break. See related discussion at 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional information, 
you need the elements themselves to contain this information: the elements are 
self-contained and have no metadata associated with them (beyond the timestamp 
and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the 
elements of the PCollection contain something ordered, e.g. have a 
PCollection<List<Something>>, where each List is for one file [I'm assuming 
Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be 
applied to very large files, this could produce very large elements, which is a 
bad idea. Because of this, I don't think the result of applying Tika to a 
single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose 
TikaIO transform that will be better than manual invocation of Tika as a DoFn 
on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html 
- almost all of the examples involve extracting a single String from each 
document. This use case, with the assumption that individual documents are 
small enough, can certainly be simplified and TikaIO could be a facade for 
doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at 
pipeline construction time) and a ContentHandler whose toString() will go into 
"content". ContentHandler does not implement Serializable, so you can not 
specify it at construction time - however, you can let the user specify either 
its class (if it's a simple handler like a BodyContentHandler) or specify a 
lambda for creating the handler (SerializableFunction<Void, ContentHandler>), 
and potentially you can have a simpler facade for Tika.parseAsString() - e.g. 
call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = 
p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new 
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO directly 
in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll 
be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin 
<sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's 
> talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't 
> had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting 
> unordered text from _a given file_ is a non-starter for most applications.  
> The implementation needs to guarantee order per file, and the user has to be 
> able to link the "extract" back to a unique identifier for the document.  If 
> the current implementation doesn't do those things, we need to change it, 
> IMHO.
>
Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call 
> it if they want it?...
>
> No matter how much we do to Tika, it will behave badly sometimes -- permanent 
> hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam 
> -- folks likely with large batches of unruly/noisy documents -- are more 
> likely to run into these problems than your average 
> couple-of-thousand-docs-from-our-own-company user. So, if there are things we 
> can do in Beam to prevent developers around the world from having to reinvent 
> the wheel for defenses against these problems, then I'd be enormously 
> grateful if we could put Tika into Beam.  That means:
>
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
>
> If Beam automatically handles those problems, then I'd say, y, let users 
> write their own code.  If there is so much as a single configuration knob 
> (and it sounds like Beam is against complex configuration...yay!) to get that 
> working in Beam, then I'd say, please integrate Tika into Beam.  From a 
> safety perspective, it is critical to keep the extraction process entirely 
> separate (jvm, vm, m, rack, data center!) from the transformation+loading 
> steps.  IMHO, very few devs realize this because Tika works well lots of the 
> time...which is why it is critical for us to make it easy for people to get 
> it right all of the time.
>
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode 
> first in one jvm, and then I kick off another process to do transform/loading 
> into Lucene/Solr from the .json files that Tika generates for each input 
> file.  If I were to scale up, I'd want to maintain this complete separation 
> of steps.
>
> Apologies if I've derailed the conversation or misunderstood this thread.
>
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
>
>                 Tim
>
> -----Original Message-----
> From: Sergey Beryozkin 
> [mailto:sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org<mailto:dev@beam.apache.org>
> Cc: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>
> Subject: Re: TikaIO concerns
>
> Hi All
>
> Please welcome Tim, one of Apache Tika leads and practitioners.
>
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to 
> share (preferably involving the cases where it did not really matter the 
> ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
>
> At the moment, even though Tika ContentHandler will emit the ordered data, 
> the Beam runtime will have no guarantees that the downstream pipeline 
> components will see the data coming in the right order.
>
> (FYI, I understand from the earlier comments that the total ordering is also 
> achievable but would require the extra API support)
>
> Other comments would be welcome too
>
> Thanks, Sergey
>
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline 
>>> metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the 
>>> pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

RE: TikaIO concerns

Reply via email to