Re: TikaIO concerns

Eugene Kirpichov Thu, 21 Sep 2017 13:41:14 -0700

Thanks all for the discussion. It seems we have consensus that both
within-document order and association with the original filename are
necessary, but currently absent from TikaIO.


*Association with original file:*
Sergey - Beam does not *automatically* provide a way to associate an
element with the file it originated from: automatically tracking data
provenance is a known very hard research problem on which many papers have
been written, and obvious solutions are very easy to break. See related
discussion at
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional
information, you need the elements themselves to contain this information:
the elements are self-contained and have no metadata associated with them
(beyond the timestamp and windows, universal to the whole Beam model).

*Order within a file:*
The only way to have any kind of order within a PCollection is to have the
elements of the PCollection contain something ordered, e.g. have a
PCollection<List<Something>>, where each List is for one file [I'm assuming
Tika, at a low level, works on a per-file basis?]. However, since TikaIO
can be applied to very large files, this could produce very large elements,
which is a bad idea. Because of this, I don't think the result of applying
Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a
*general-purpose* TikaIO transform that will be better than manual
invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at
https://tika.apache.org/1.16/examples.html - almost all of the examples
involve extracting a single String from each document. This use case, with
the assumption that individual documents are small enough, can certainly be
simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however, you
can let the user specify either its class (if it's a simple handler like a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
> > Thank you, Sergey.
> >
> > My knowledge of Apache Beam is limited -- I saw Davor and
> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
> impressed, but I haven't had a chance to work with it yet.
> >
> >  From my perspective, if I understand this thread (and I may not!),
> getting unordered text from _a given file_ is a non-starter for most
> applications.  The implementation needs to guarantee order per file, and
> the user has to be able to link the "extract" back to a unique identifier
> for the document.  If the current implementation doesn't do those things,
> we need to change it, IMHO.
> >
> Right now Tika-related reader does not associate a given text fragment
> with the file name, so a function looking at some text and trying to
> find where it came from won't be able to do so.
>
> So I asked how to do it in Beam, how to attach some context to the given
> piece of data. I hope it can be done and if not - then perhaps some
> improvement can be applied.
>
> Re the unordered text - yes - this is what we currently have with Beam +
> TikaIO :-).
>
> The use-case I referred to earlier in this thread (upload PDFs - save
> the possibly unordered text to Lucene with the file name 'attached', let
> users search for the files containing some words - phrases, this works
> OK given that I can see PDF parser for ex reporting the lines) can be
> supported OK with the current TikaIO (provided we find a way to 'attach'
> a file name to the flow).
>
> I see though supporting the total ordering can be a big deal in other
> cases. Eugene, can you please explain how it can be done, is it
> achievable in principle, without the users having to do some custom
> coding ?
>
> > To the question of -- why is this in Beam at all; why don't we let users
> call it if they want it?...
> >
> > No matter how much we do to Tika, it will behave badly sometimes --
> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
> using Beam -- folks likely with large batches of unruly/noisy documents --
> are more likely to run into these problems than your average
> couple-of-thousand-docs-from-our-own-company user. So, if there are things
> we can do in Beam to prevent developers around the world from having to
> reinvent the wheel for defenses against these problems, then I'd be
> enormously grateful if we could put Tika into Beam.  That means:
> >
> > 1) a process-level timeout (because you can't actually kill a thread in
> Java)
> > 2) a process-level restart on OOM
> > 3) avoid trying to reprocess a badly behaving document
> >
> > If Beam automatically handles those problems, then I'd say, y, let users
> write their own code.  If there is so much as a single configuration knob
> (and it sounds like Beam is against complex configuration...yay!) to get
> that working in Beam, then I'd say, please integrate Tika into Beam.  From
> a safety perspective, it is critical to keep the extraction process
> entirely separate (jvm, vm, m, rack, data center!) from the
> transformation+loading steps.  IMHO, very few devs realize this because
> Tika works well lots of the time...which is why it is critical for us to
> make it easy for people to get it right all of the time.
> >
> > Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
> mode first in one jvm, and then I kick off another process to do
> transform/loading into Lucene/Solr from the .json files that Tika generates
> for each input file.  If I were to scale up, I'd want to maintain this
> complete separation of steps.
> >
> > Apologies if I've derailed the conversation or misunderstood this thread.
> >
> Major thanks for your input :-)
>
> Cheers, Sergey
>
> > Cheers,
> >
> >                 Tim
> >
> > -----Original Message-----
> > From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
> > Sent: Thursday, September 21, 2017 9:07 AM
> > To: dev@beam.apache.org
> > Cc: Allison, Timothy B. <talli...@mitre.org>
> > Subject: Re: TikaIO concerns
> >
> > Hi All
> >
> > Please welcome Tim, one of Apache Tika leads and practitioners.
> >
> > Tim, thanks for joining in :-). If you have some great Apache Tika
> stories to share (preferably involving the cases where it did not really
> matter the ordering in which Tika-produced data were dealt with by the
> > consumers) then please do so :-).
> >
> > At the moment, even though Tika ContentHandler will emit the ordered
> data, the Beam runtime will have no guarantees that the downstream pipeline
> components will see the data coming in the right order.
> >
> > (FYI, I understand from the earlier comments that the total ordering is
> also achievable but would require the extra API support)
> >
> > Other comments would be welcome too
> >
> > Thanks, Sergey
> >
> > On 21/09/17 10:55, Sergey Beryozkin wrote:
> >> I noticed that the PDF and ODT parsers actually split by lines, not
> >> individual words and nearly 100% sure I saw Tika reporting individual
> >> lines when it was parsing the text files. The 'min text length'
> >> feature can help with reporting several lines at a time, etc...
> >>
> >> I'm working with this PDF all the time:
> >> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> >>
> >> try it too if you get a chance.
> >>
> >> (and I can imagine not all PDFs/etc representing the 'story' but can
> >> be for ex a log-like content too)
> >>
> >> That said, I don't know how a parser for the format N will behave, it
> >> depends on the individual parsers.
> >>
> >> IMHO it's an equal candidate alongside Text-based bounded IOs...
> >>
> >> I'd like to know though how to make a file name available to the
> >> pipeline which is working with the current text fragment ?
> >>
> >> Going to try and do some measurements and compare the sync vs async
> >> parsing modes...
> >>
> >> Asked the Tika team to support with some more examples...
> >>
> >> Cheers, Sergey
> >> On 20/09/17 22:17, Sergey Beryozkin wrote:
> >>> Hi,
> >>>
> >>> thanks for the explanations,
> >>>
> >>> On 20/09/17 16:41, Eugene Kirpichov wrote:
> >>>> Hi!
> >>>>
> >>>> TextIO returns an unordered soup of lines contained in all files you
> >>>> ask it to read. People usually use TextIO for reading files where 1
> >>>> line corresponds to 1 independent data element, e.g. a log entry, or
> >>>> a row of a CSV file - so discarding order is ok.
> >>> Just a side note, I'd probably want that be ordered, though I guess
> >>> it depends...
> >>>> However, there is a number of cases where TextIO is a poor fit:
> >>>> - Cases where discarding order is not ok - e.g. if you're doing
> >>>> natural language processing and the text files contain actual prose,
> >>>> where you need to process a file as a whole. TextIO can't do that.
> >>>> - Cases where you need to remember which file each element came
> >>>> from, e.g.
> >>>> if you're creating a search index for the files: TextIO can't do
> >>>> this either.
> >>>>
> >>>> Both of these issues have been raised in the past against TextIO;
> >>>> however it seems that the overwhelming majority of users of TextIO
> >>>> use it for logs or CSV files or alike, so solving these issues has
> >>>> not been a priority.
> >>>> Currently they are solved in a general form via FileIO.read() which
> >>>> gives you access to reading a full file yourself - people who want
> >>>> more flexibility will be able to use standard Java text-parsing
> >>>> utilities on a ReadableFile, without involving TextIO.
> >>>>
> >>>> Same applies for XmlIO: it is specifically designed for the narrow
> >>>> use case where the files contain independent data entries, so
> >>>> returning an unordered soup of them, with no association to the
> >>>> original file, is the user's intention. XmlIO will not work for
> >>>> processing more complex XML files that are not simply a sequence of
> >>>> entries with the same tag, and it also does not remember the
> >>>> original filename.
> >>>>
> >>>
> >>> OK...
> >>>
> >>>> However, if my understanding of Tika use cases is correct, it is
> >>>> mainly used for extracting content from complex file formats - for
> >>>> example, extracting text and images from PDF files or Word
> >>>> documents. I believe this is the main difference between it and
> >>>> TextIO - people usually use Tika for complex use cases where the
> >>>> "unordered soup of stuff" abstraction is not useful.
> >>>>
> >>>> My suspicion about this is confirmed by the fact that the crux of
> >>>> the Tika API is ContentHandler
> >>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
> >>>> html?is-external=true
> >>>>
> >>>> whose
> >>>> documentation says "The order of events in this interface is very
> >>>> important, and mirrors the order of information in the document
> itself."
> >>> All that says is that a (Tika) ContentHandler will be a true SAX
> >>> ContentHandler...
> >>>>
> >>>> Let me give a few examples of what I think is possible with the raw
> >>>> Tika API, but I think is not currently possible with TikaIO - please
> >>>> correct me where I'm wrong, because I'm not particularly familiar
> >>>> with Tika and am judging just based on what I read about it.
> >>>> - User has 100,000 Word documents and wants to convert each of them
> >>>> to text files for future natural language processing.
> >>>> - User has 100,000 PDF files with financial statements, each
> >>>> containing a bunch of unrelated text and - the main content - a list
> >>>> of transactions in PDF tables. User wants to extract each
> >>>> transaction as a PCollection element, discarding the unrelated text.
> >>>> - User has 100,000 PDF files with scientific papers, and wants to
> >>>> extract text from them, somehow parse author and affiliation from
> >>>> the text, and compute statistics of topics and terminology usage by
> >>>> author name and affiliation.
> >>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
> >>>> observing a location over time: they want to extract metadata from
> >>>> each image using Tika, analyze the images themselves using some
> >>>> other library, and detect anomalies in the overall appearance of the
> >>>> location over time as seen from multiple cameras.
> >>>> I believe all of these cases can not be solved with TikaIO because
> >>>> the resulting PCollection<String> contains no information about
> >>>> which String comes from which document and about the order in which
> >>>> they appear in the document.
> >>> These are good use cases, thanks... I thought what you were talking
> >>> about the unordered soup of data produced by TikaIO (and its friends
> >>> TextIO and alike :-)).
> >>> Putting the ordered vs unordered question aside for a sec, why
> >>> exactly a Tika Reader can not make the name of the file it's
> >>> currently reading from available to the pipeline, as some Beam
> pipeline metadata piece ?
> >>> Surely it can be possible with Beam ? If not then I would be
> surprised...
> >>>
> >>>>
> >>>> I am, honestly, struggling to think of a case where I would want to
> >>>> use Tika, but where I *would* be ok with getting an unordered soup
> >>>> of strings.
> >>>> So some examples would be very helpful.
> >>>>
> >>> Yes. I'll ask Tika developers to help with some examples, but I'll
> >>> give one example where it did not matter to us in what order
> >>> Tika-produced data were available to the downstream layer.
> >>>
> >>> It's a demo the Apache CXF colleague of mine showed at one of Apache
> >>> Con NAs, and we had a happy audience:
> >>>
> >>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search
> >>>
> >>>
> >>> PDF or ODT files uploaded, Tika parses them, and all of that is put
> >>> into Lucene. We associate a file name with the indexed content and
> >>> then let users find a list of PDF files which contain a given word or
> >>> few words, details are here
> >>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
> >>> og.java#L131
> >>>
> >>>
> >>> I'd say even more involved search engines would not mind supporting a
> >>> case like that :-)
> >>>
> >>> Now there we process one file at a time, and I understand now that
> >>> with TikaIO and N files it's all over the place really as far as the
> >>> ordering is concerned, which file it's coming from. etc. That's why
> >>> TikaReader must be able to associate the file name with a given piece
> >>> of text it's making available to the pipeline.
> >>>
> >>> I'd be happy to support the ParDo way of linking Tika with Beam.
> >>> If it makes things simpler then it would be good, I've just no idea
> >>> at the moment how to start the pipeline without using a
> >>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
> >>> earlier - how can one avoid it with ParDo when implementing a 'min
> >>> len chunk' feature, where the ParDo would have to concatenate several
> >>> SAX data pieces first before making a single composite piece to the
> pipeline ?
> >>>
> >>>
> >>>> Another way to state it: currently, if I wanted to solve all of the
> >>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
> >>>> API myself on the resulting ReadableFile. How can we make TikaIO
> >>>> provide a usability improvement over such usage?
> >>>>
> >>>
> >>>
> >>> If you are actually asking, does it really make sense for Beam to
> >>> ship Tika related code, given that users can just do it themselves,
> >>> I'm not sure.
> >>>
> >>> IMHO it always works better if users have to provide just few config
> >>> options to an integral part of the framework and see things happening.
> >>> It will bring more users.
> >>>
> >>> Whether the current Tika code (refactored or not) stays with Beam or
> >>> not - I'll let you and the team decide; believe it or not I was
> >>> seriously contemplating at the last moment to make it all part of the
> >>> Tika project itself and have a bit more flexibility over there with
> >>> tweaking things, but now that it is in the Beam snapshot - I don't
> >>> know - it's no my decision...
> >>>
> >>>> I am confused by your other comment - "Does the ordering matter ?
> >>>> Perhaps
> >>>> for some cases it does, and for some it does not. May be it makes
> >>>> sense to support running TikaIO as both the bounded reader/source
> >>>> and ParDo, with getting the common code reused." - because using
> >>>> BoundedReader or ParDo is not related to the ordering issue, only to
> >>>> the issue of asynchronous reading and complexity of implementation.
> >>>> The resulting PCollection will be unordered either way - this needs
> >>>> to be solved separately by providing a different API.
> >>> Right I see now, so ParDo is not about making Tika reported data
> >>> available to the downstream pipeline components ordered, only about
> >>> the simpler implementation.
> >>> Association with the file should be possible I hope, but I understand
> >>> it would be possible to optionally make the data coming out in the
> >>> ordered way as well...
> >>>
> >>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
> >>> let me double check: should we still give some thought to the
> >>> possible performance benefit of the current approach ? As I said, I
> >>> can easily get rid of all that polling code, use a simple Blocking
> queue.
> >>>
> >>> Cheers, Sergey
> >>>>
> >>>> Thanks.
> >>>>
> >>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
> >>>> <sberyoz...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> Glad TikaIO getting some serious attention :-), I believe one thing
> >>>>> we both agree upon is that Tika can help Beam in its own unique way.
> >>>>>
> >>>>> Before trying to reply online, I'd like to state that my main
> >>>>> assumption is that TikaIO (as far as the read side is concerned) is
> >>>>> no different to Text, XML or similar bounded reader components.
> >>>>>
> >>>>> I have to admit I don't understand your questions about TikaIO
> >>>>> usecases.
> >>>>>
> >>>>> What are the Text Input or XML input use-cases ? These use cases
> >>>>> are TikaInput cases as well, the only difference is Tika can not
> >>>>> split the individual file into a sequence of sources/etc,
> >>>>>
> >>>>> TextIO can read from the plain text files (possibly zipped), XML -
> >>>>> optimized around reading from the XML files, and I thought I made
> >>>>> it clear (and it is a known fact anyway) Tika was about reading
> >>>>> basically from any file format.
> >>>>>
> >>>>> Where is the difference (apart from what I've already mentioned) ?
> >>>>>
> >>>>> Sergey
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Replies inline.
> >>>>>>
> >>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
> >>>>>> <sberyoz...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi All
> >>>>>>>
> >>>>>>> This is my first post the the dev list, I work for Talend, I'm a
> >>>>>>> Beam novice, Apache Tika fan, and thought it would be really
> >>>>>>> great to try and link both projects together, which led me to
> >>>>>>> opening [1] where I typed some early thoughts, followed by PR
> >>>>>>> [2].
> >>>>>>>
> >>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
> >>>>>>> newer review comments from Eugene pending, so I'd like to
> >>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
> >>>>>>> decide, based on the feedback from the experts, what to do next.
> >>>>>>>
> >>>>>>> Apache Tika Parsers report the text content in chunks, via
> >>>>>>> SaxParser events. It's not possible with Tika to take a file and
> >>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
> >>>>>>> by line, the only way is to handle the SAXParser callbacks which
> >>>>>>> report the data chunks.
> >>>>>>> Some
> >>>>>>> parsers may report the complete lines, some individual words,
> >>>>>>> with some being able report the data only after the completely
> >>>>>>> parse the document.
> >>>>>>> All depends on the data format.
> >>>>>>>
> >>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
> >>>>>>> to parse the files, Beam threads will only collect the data from
> >>>>>>> the internal queue where the internal TikaReader's thread will
> >>>>>>> put the data into (note the data chunks are ordered even though
> >>>>>>> the tests might suggest otherwise).
> >>>>>>>
> >>>>>> I agree that your implementation of reader returns records in
> >>>>>> order
> >>>>>> - but
> >>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
> >>>>>> the order in which records are produced by a BoundedReader - the
> >>>>>> order produced by your reader is ignored, and when applying any
> >>>>>> transforms to the
> >>>>> PCollection
> >>>>>> produced by TikaIO, it is impossible to recover the order in which
> >>>>>> your reader returned the records.
> >>>>>>
> >>>>>> With that in mind, is PCollection<String>, containing individual
> >>>>>> Tika-detected items, still the right API for representing the
> >>>>>> result of parsing a large number of documents with Tika?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> The reason I did it was because I thought
> >>>>>>>
> >>>>>>> 1) it would make the individual data chunks available faster to
> >>>>>>> the pipeline - the parser will continue working via the
> >>>>>>> binary/video etc file while the data will already start flowing -
> >>>>>>> I agree there should be some tests data available confirming it -
> >>>>>>> but I'm positive at the moment this approach might yield some
> >>>>>>> performance gains with the large sets. If the file is large, if
> >>>>>>> it has the embedded attachments/videos to deal with, then it may
> >>>>>>> be more effective not to get the Beam thread deal with it...
> >>>>>>>
> >>>>>>> As I said on the PR, this description contains unfounded and
> >>>>>>> potentially
> >>>>>> incorrect assumptions about how Beam runners execute (or may
> >>>>>> execute in
> >>>>> the
> >>>>>> future) a ParDo or a BoundedReader. For example, if I understand
> >>>>> correctly,
> >>>>>> you might be assuming that:
> >>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
> >>>>> complete
> >>>>>> before processing its outputs with downstream transforms
> >>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
> >>>>> *concurrently*
> >>>>>> with downstream processing of its results
> >>>>>> - Passing an element from one thread to another using a
> >>>>>> BlockingQueue is free in terms of performance All of these are
> >>>>>> false at least in some runners, and I'm almost certain that in
> >>>>>> reality, performance of this approach is worse than a ParDo in
> >>>>> most
> >>>>>> production runners.
> >>>>>>
> >>>>>> There are other disadvantages to this approach:
> >>>>>> - Doing the bulk of the processing in a separate thread makes it
> >>>>> invisible
> >>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
> >>>>>> profiling capabilities, or the ability to get the current stack
> >>>>>> trace for stuck elements, this approach would make the real
> >>>>>> processing invisible to all of these capabilities, and a user
> >>>>>> would only see that the bulk of the time is spent waiting for the
> >>>>>> next element, but not *why* the next
> >>>>> element
> >>>>>> is taking long to compute.
> >>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
> >>>>>> invisible to Beam, will make it harder for runners to do
> >>>>>> autoscaling, binpacking
> >>>>> and
> >>>>>> other resource management magic (how much of this runners actually
> >>>>>> do is
> >>>>> a
> >>>>>> separate issue), because the runner will have no way of knowing
> >>>>>> how much CPU/IO this particular transform is actually using - all
> >>>>>> the processing happens in a thread about which the runner is
> >>>>>> unaware.
> >>>>>> - As far as I can tell, the code also hides exceptions that happen
> >>>>>> in the Tika thread
> >>>>>> - Adding the thread management makes the code much more complex,
> >>>>>> easier
> >>>>> to
> >>>>>> introduce bugs, and harder for others to contribute
> >>>>>>
> >>>>>>
> >>>>>>> 2) As I commented at the end of [2], having an option to
> >>>>>>> concatenate the data chunks first before making them available to
> >>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
> >>>>>>> introduce some synchronization issues (though not exactly sure
> >>>>>>> yet)
> >>>>>>>
> >>>>>> What are these issues?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> One of valid concerns there is that the reader is polling the
> >>>>>>> internal queue so, in theory at least, and perhaps in some rare
> >>>>>>> cases too, we may have a case where the max polling time has been
> >>>>>>> reached, the parser is still busy, and TikaIO fails to report all
> >>>>>>> the file data. I think that it can be solved by either 2a)
> >>>>>>> configuring the max polling time to a very large number which
> >>>>>>> will never be reached for a practical case, or
> >>>>>>> 2b) simply use a blocking queue without the time limits - in the
> >>>>>>> worst case, if TikaParser spins and fails to report the end of
> >>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
> >>>>>>> I propose to follow 2b).
> >>>>>>>
> >>>>>> I agree that there should be no way to unintentionally configure
> >>>>>> the transform in a way that will produce silent data loss. Another
> >>>>>> reason for not having these tuning knobs is that it goes against
> >>>>>> Beam's "no knobs"
> >>>>>> philosophy, and that in most cases users have no way of figuring
> >>>>>> out a
> >>>>> good
> >>>>>> value for tuning knobs except for manual experimentation, which is
> >>>>>> extremely brittle and typically gets immediately obsoleted by
> >>>>>> running on
> >>>>> a
> >>>>>> new dataset or updating a version of some of the involved
> >>>>>> dependencies
> >>>>> etc.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Please let me know what you think.
> >>>>>>> My plan so far is:
> >>>>>>> 1) start addressing most of Eugene's comments which would require
> >>>>>>> some minor TikaIO updates
> >>>>>>> 2) work on removing the TikaSource internal code dealing with
> >>>>>>> File patterns which I copied from TextIO at the next stage
> >>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
> >>>>>>> users some time to try it with some real complex files and also
> >>>>>>> decide if TikaIO can continue implemented as a
> >>>>>>> BoundedSource/Reader or not
> >>>>>>>
> >>>>>>> Eugene, all, will it work if I start with 1) ?
> >>>>>>>
> >>>>>> Yes, but I think we should start by discussing the anticipated use
> >>>>>> cases
> >>>>> of
> >>>>>> TikaIO and designing an API for it based on those use cases; and
> >>>>>> then see what's the best implementation for that particular API
> >>>>>> and set of anticipated use cases.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks, Sergey
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
> >>>>>>> [2] https://github.com/apache/beam/pull/3378
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
>

Re: TikaIO concerns

Reply via email to