Re: TikaIO concerns

Chris Mattmann Thu, 21 Sep 2017 14:18:20 -0700

Hi all,

One other thing is that Tika extracts metadata, and language information in 
which order
doesn’t matter (Keys can be out of order).


Would this be useful?

Cheers,
Chris




On 9/21/17, 2:10 PM, "Sergey Beryozkin" <[email protected]> wrote:

    Hi Eugene
    
    Thank you, very helpful, let me read it few times before I get what 
    exactly I need to clarify :-), two questions so far:
    
    On 21/09/17 21:40, Eugene Kirpichov wrote:
    > Thanks all for the discussion. It seems we have consensus that both
    > within-document order and association with the original filename are
    > necessary, but currently absent from TikaIO.
    > 
    > *Association with original file:*
    > Sergey - Beam does not *automatically* provide a way to associate an
    > element with the file it originated from: automatically tracking data
    > provenance is a known very hard research problem on which many papers have
    > been written, and obvious solutions are very easy to break. See related
    > discussion at
    > 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
    >   .
    > 
    > If you want the elements of your PCollection to contain additional
    > information, you need the elements themselves to contain this information:
    > the elements are self-contained and have no metadata associated with them
    > (beyond the timestamp and windows, universal to the whole Beam model).
    > 
    > *Order within a file:*
    > The only way to have any kind of order within a PCollection is to have the
    > elements of the PCollection contain something ordered, e.g. have a
    > PCollection<List<Something>>, where each List is for one file [I'm 
assuming
    > Tika, at a low level, works on a per-file basis?]. However, since TikaIO
    > can be applied to very large files, this could produce very large 
elements,
    > which is a bad idea. Because of this, I don't think the result of applying
    > Tika to a single file can be encoded as a PCollection element.
    > 
    > Given both of these, I think that it's not possible to create a
    > *general-purpose* TikaIO transform that will be better than manual
    > invocation of Tika as a DoFn on the result of FileIO.readMatches().
    > 
    > However, looking at the examples at
    > https://tika.apache.org/1.16/examples.html - almost all of the examples
    > involve extracting a single String from each document. This use case, with
    > the assumption that individual documents are small enough, can certainly 
be
    > simplified and TikaIO could be a facade for doing just this.
    > 
    > E.g. TikaIO could:
    > - take as input a PCollection<ReadableFile>
    > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
    > is a class with properties { String content, Metadata metadata }
    
    and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
    represents the content + (Tika) Metadata of the file such as the author 
    name, etc ? Is it the file name ?
    > - be configured by: a Parser (it implements Serializable so can be
    > specified at pipeline construction time) and a ContentHandler whose
    > toString() will go into "content". ContentHandler does not implement
    > Serializable, so you can not specify it at construction time - however, 
you
    > can let the user specify either its class (if it's a simple handler like a
    > BodyContentHandler) or specify a lambda for creating the handler
    > (SerializableFunction<Void, ContentHandler>), and potentially you can have
    > a simpler facade for Tika.parseAsString() - e.g. call it
    > TikaIO.parseAllAsStrings().
    > 
    > Example usage would look like:
    > 
    >    PCollection<KV<String, ParseResult>> parseResults =
    > p.apply(FileIO.match().filepattern(...))
    >      .apply(FileIO.readMatches())
    >      .apply(TikaIO.parseAllAsStrings())
    > 
    > or:
    > 
    >      .apply(TikaIO.parseAll()
    >          .withParser(new AutoDetectParser())
    >          .withContentHandler(() -> new BodyContentHandler(new
    > ToXMLContentHandler())))
    > 
    > You could also have shorthands for letting the user avoid using FileIO
    > directly in simple cases, for example:
    >      p.apply(TikaIO.parseAsStrings().from(filepattern))
    > 
    > This would of course be implemented as a ParDo or even MapElements, and
    > you'll be able to share the code between parseAll and regular parse.
    > 
    OK. What about the current source on the master, should be marked 
    Experimental till I manage to write something new with the above ideas 
    in mind ? Or there's enough time till 2.2.0 gets released ?
    
    Thanks, Sergey
    > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <[email protected]>
    > wrote:
    > 
    >> Hi Tim
    >> On 21/09/17 14:33, Allison, Timothy B. wrote:
    >>> Thank you, Sergey.
    >>>
    >>> My knowledge of Apache Beam is limited -- I saw Davor and
    >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
    >> impressed, but I haven't had a chance to work with it yet.
    >>>
    >>>   From my perspective, if I understand this thread (and I may not!),
    >> getting unordered text from _a given file_ is a non-starter for most
    >> applications.  The implementation needs to guarantee order per file, and
    >> the user has to be able to link the "extract" back to a unique identifier
    >> for the document.  If the current implementation doesn't do those things,
    >> we need to change it, IMHO.
    >>>
    >> Right now Tika-related reader does not associate a given text fragment
    >> with the file name, so a function looking at some text and trying to
    >> find where it came from won't be able to do so.
    >>
    >> So I asked how to do it in Beam, how to attach some context to the given
    >> piece of data. I hope it can be done and if not - then perhaps some
    >> improvement can be applied.
    >>
    >> Re the unordered text - yes - this is what we currently have with Beam +
    >> TikaIO :-).
    >>
    >> The use-case I referred to earlier in this thread (upload PDFs - save
    >> the possibly unordered text to Lucene with the file name 'attached', let
    >> users search for the files containing some words - phrases, this works
    >> OK given that I can see PDF parser for ex reporting the lines) can be
    >> supported OK with the current TikaIO (provided we find a way to 'attach'
    >> a file name to the flow).
    >>
    >> I see though supporting the total ordering can be a big deal in other
    >> cases. Eugene, can you please explain how it can be done, is it
    >> achievable in principle, without the users having to do some custom
    >> coding ?
    >>
    >>> To the question of -- why is this in Beam at all; why don't we let users
    >> call it if they want it?...
    >>>
    >>> No matter how much we do to Tika, it will behave badly sometimes --
    >> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine 
folks
    >> using Beam -- folks likely with large batches of unruly/noisy documents 
--
    >> are more likely to run into these problems than your average
    >> couple-of-thousand-docs-from-our-own-company user. So, if there are 
things
    >> we can do in Beam to prevent developers around the world from having to
    >> reinvent the wheel for defenses against these problems, then I'd be
    >> enormously grateful if we could put Tika into Beam.  That means:
    >>>
    >>> 1) a process-level timeout (because you can't actually kill a thread in
    >> Java)
    >>> 2) a process-level restart on OOM
    >>> 3) avoid trying to reprocess a badly behaving document
    >>>
    >>> If Beam automatically handles those problems, then I'd say, y, let users
    >> write their own code.  If there is so much as a single configuration knob
    >> (and it sounds like Beam is against complex configuration...yay!) to get
    >> that working in Beam, then I'd say, please integrate Tika into Beam.  
From
    >> a safety perspective, it is critical to keep the extraction process
    >> entirely separate (jvm, vm, m, rack, data center!) from the
    >> transformation+loading steps.  IMHO, very few devs realize this because
    >> Tika works well lots of the time...which is why it is critical for us to
    >> make it easy for people to get it right all of the time.
    >>>
    >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
    >> mode first in one jvm, and then I kick off another process to do
    >> transform/loading into Lucene/Solr from the .json files that Tika 
generates
    >> for each input file.  If I were to scale up, I'd want to maintain this
    >> complete separation of steps.
    >>>
    >>> Apologies if I've derailed the conversation or misunderstood this 
thread.
    >>>
    >> Major thanks for your input :-)
    >>
    >> Cheers, Sergey
    >>
    >>> Cheers,
    >>>
    >>>                  Tim
    >>>
    >>> -----Original Message-----
    >>> From: Sergey Beryozkin [mailto:[email protected]]
    >>> Sent: Thursday, September 21, 2017 9:07 AM
    >>> To: [email protected]
    >>> Cc: Allison, Timothy B. <[email protected]>
    >>> Subject: Re: TikaIO concerns
    >>>
    >>> Hi All
    >>>
    >>> Please welcome Tim, one of Apache Tika leads and practitioners.
    >>>
    >>> Tim, thanks for joining in :-). If you have some great Apache Tika
    >> stories to share (preferably involving the cases where it did not really
    >> matter the ordering in which Tika-produced data were dealt with by the
    >>> consumers) then please do so :-).
    >>>
    >>> At the moment, even though Tika ContentHandler will emit the ordered
    >> data, the Beam runtime will have no guarantees that the downstream 
pipeline
    >> components will see the data coming in the right order.
    >>>
    >>> (FYI, I understand from the earlier comments that the total ordering is
    >> also achievable but would require the extra API support)
    >>>
    >>> Other comments would be welcome too
    >>>
    >>> Thanks, Sergey
    >>>
    >>> On 21/09/17 10:55, Sergey Beryozkin wrote:
    >>>> I noticed that the PDF and ODT parsers actually split by lines, not
    >>>> individual words and nearly 100% sure I saw Tika reporting individual
    >>>> lines when it was parsing the text files. The 'min text length'
    >>>> feature can help with reporting several lines at a time, etc...
    >>>>
    >>>> I'm working with this PDF all the time:
    >>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
    >>>>
    >>>> try it too if you get a chance.
    >>>>
    >>>> (and I can imagine not all PDFs/etc representing the 'story' but can
    >>>> be for ex a log-like content too)
    >>>>
    >>>> That said, I don't know how a parser for the format N will behave, it
    >>>> depends on the individual parsers.
    >>>>
    >>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
    >>>>
    >>>> I'd like to know though how to make a file name available to the
    >>>> pipeline which is working with the current text fragment ?
    >>>>
    >>>> Going to try and do some measurements and compare the sync vs async
    >>>> parsing modes...
    >>>>
    >>>> Asked the Tika team to support with some more examples...
    >>>>
    >>>> Cheers, Sergey
    >>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
    >>>>> Hi,
    >>>>>
    >>>>> thanks for the explanations,
    >>>>>
    >>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
    >>>>>> Hi!
    >>>>>>
    >>>>>> TextIO returns an unordered soup of lines contained in all files you
    >>>>>> ask it to read. People usually use TextIO for reading files where 1
    >>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
    >>>>>> a row of a CSV file - so discarding order is ok.
    >>>>> Just a side note, I'd probably want that be ordered, though I guess
    >>>>> it depends...
    >>>>>> However, there is a number of cases where TextIO is a poor fit:
    >>>>>> - Cases where discarding order is not ok - e.g. if you're doing
    >>>>>> natural language processing and the text files contain actual prose,
    >>>>>> where you need to process a file as a whole. TextIO can't do that.
    >>>>>> - Cases where you need to remember which file each element came
    >>>>>> from, e.g.
    >>>>>> if you're creating a search index for the files: TextIO can't do
    >>>>>> this either.
    >>>>>>
    >>>>>> Both of these issues have been raised in the past against TextIO;
    >>>>>> however it seems that the overwhelming majority of users of TextIO
    >>>>>> use it for logs or CSV files or alike, so solving these issues has
    >>>>>> not been a priority.
    >>>>>> Currently they are solved in a general form via FileIO.read() which
    >>>>>> gives you access to reading a full file yourself - people who want
    >>>>>> more flexibility will be able to use standard Java text-parsing
    >>>>>> utilities on a ReadableFile, without involving TextIO.
    >>>>>>
    >>>>>> Same applies for XmlIO: it is specifically designed for the narrow
    >>>>>> use case where the files contain independent data entries, so
    >>>>>> returning an unordered soup of them, with no association to the
    >>>>>> original file, is the user's intention. XmlIO will not work for
    >>>>>> processing more complex XML files that are not simply a sequence of
    >>>>>> entries with the same tag, and it also does not remember the
    >>>>>> original filename.
    >>>>>>
    >>>>>
    >>>>> OK...
    >>>>>
    >>>>>> However, if my understanding of Tika use cases is correct, it is
    >>>>>> mainly used for extracting content from complex file formats - for
    >>>>>> example, extracting text and images from PDF files or Word
    >>>>>> documents. I believe this is the main difference between it and
    >>>>>> TextIO - people usually use Tika for complex use cases where the
    >>>>>> "unordered soup of stuff" abstraction is not useful.
    >>>>>>
    >>>>>> My suspicion about this is confirmed by the fact that the crux of
    >>>>>> the Tika API is ContentHandler
    >>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
    >>>>>> html?is-external=true
    >>>>>>
    >>>>>> whose
    >>>>>> documentation says "The order of events in this interface is very
    >>>>>> important, and mirrors the order of information in the document
    >> itself."
    >>>>> All that says is that a (Tika) ContentHandler will be a true SAX
    >>>>> ContentHandler...
    >>>>>>
    >>>>>> Let me give a few examples of what I think is possible with the raw
    >>>>>> Tika API, but I think is not currently possible with TikaIO - please
    >>>>>> correct me where I'm wrong, because I'm not particularly familiar
    >>>>>> with Tika and am judging just based on what I read about it.
    >>>>>> - User has 100,000 Word documents and wants to convert each of them
    >>>>>> to text files for future natural language processing.
    >>>>>> - User has 100,000 PDF files with financial statements, each
    >>>>>> containing a bunch of unrelated text and - the main content - a list
    >>>>>> of transactions in PDF tables. User wants to extract each
    >>>>>> transaction as a PCollection element, discarding the unrelated text.
    >>>>>> - User has 100,000 PDF files with scientific papers, and wants to
    >>>>>> extract text from them, somehow parse author and affiliation from
    >>>>>> the text, and compute statistics of topics and terminology usage by
    >>>>>> author name and affiliation.
    >>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
    >>>>>> observing a location over time: they want to extract metadata from
    >>>>>> each image using Tika, analyze the images themselves using some
    >>>>>> other library, and detect anomalies in the overall appearance of the
    >>>>>> location over time as seen from multiple cameras.
    >>>>>> I believe all of these cases can not be solved with TikaIO because
    >>>>>> the resulting PCollection<String> contains no information about
    >>>>>> which String comes from which document and about the order in which
    >>>>>> they appear in the document.
    >>>>> These are good use cases, thanks... I thought what you were talking
    >>>>> about the unordered soup of data produced by TikaIO (and its friends
    >>>>> TextIO and alike :-)).
    >>>>> Putting the ordered vs unordered question aside for a sec, why
    >>>>> exactly a Tika Reader can not make the name of the file it's
    >>>>> currently reading from available to the pipeline, as some Beam
    >> pipeline metadata piece ?
    >>>>> Surely it can be possible with Beam ? If not then I would be
    >> surprised...
    >>>>>
    >>>>>>
    >>>>>> I am, honestly, struggling to think of a case where I would want to
    >>>>>> use Tika, but where I *would* be ok with getting an unordered soup
    >>>>>> of strings.
    >>>>>> So some examples would be very helpful.
    >>>>>>
    >>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
    >>>>> give one example where it did not matter to us in what order
    >>>>> Tika-produced data were available to the downstream layer.
    >>>>>
    >>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
    >>>>> Con NAs, and we had a happy audience:
    >>>>>
    >>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search
    >>>>>
    >>>>>
    >>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
    >>>>> into Lucene. We associate a file name with the indexed content and
    >>>>> then let users find a list of PDF files which contain a given word or
    >>>>> few words, details are here
    >>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
    >>>>> og.java#L131
    >>>>>
    >>>>>
    >>>>> I'd say even more involved search engines would not mind supporting a
    >>>>> case like that :-)
    >>>>>
    >>>>> Now there we process one file at a time, and I understand now that
    >>>>> with TikaIO and N files it's all over the place really as far as the
    >>>>> ordering is concerned, which file it's coming from. etc. That's why
    >>>>> TikaReader must be able to associate the file name with a given piece
    >>>>> of text it's making available to the pipeline.
    >>>>>
    >>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
    >>>>> If it makes things simpler then it would be good, I've just no idea
    >>>>> at the moment how to start the pipeline without using a
    >>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
    >>>>> earlier - how can one avoid it with ParDo when implementing a 'min
    >>>>> len chunk' feature, where the ParDo would have to concatenate several
    >>>>> SAX data pieces first before making a single composite piece to the
    >> pipeline ?
    >>>>>
    >>>>>
    >>>>>> Another way to state it: currently, if I wanted to solve all of the
    >>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
    >>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
    >>>>>> provide a usability improvement over such usage?
    >>>>>>
    >>>>>
    >>>>>
    >>>>> If you are actually asking, does it really make sense for Beam to
    >>>>> ship Tika related code, given that users can just do it themselves,
    >>>>> I'm not sure.
    >>>>>
    >>>>> IMHO it always works better if users have to provide just few config
    >>>>> options to an integral part of the framework and see things happening.
    >>>>> It will bring more users.
    >>>>>
    >>>>> Whether the current Tika code (refactored or not) stays with Beam or
    >>>>> not - I'll let you and the team decide; believe it or not I was
    >>>>> seriously contemplating at the last moment to make it all part of the
    >>>>> Tika project itself and have a bit more flexibility over there with
    >>>>> tweaking things, but now that it is in the Beam snapshot - I don't
    >>>>> know - it's no my decision...
    >>>>>
    >>>>>> I am confused by your other comment - "Does the ordering matter ?
    >>>>>> Perhaps
    >>>>>> for some cases it does, and for some it does not. May be it makes
    >>>>>> sense to support running TikaIO as both the bounded reader/source
    >>>>>> and ParDo, with getting the common code reused." - because using
    >>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
    >>>>>> the issue of asynchronous reading and complexity of implementation.
    >>>>>> The resulting PCollection will be unordered either way - this needs
    >>>>>> to be solved separately by providing a different API.
    >>>>> Right I see now, so ParDo is not about making Tika reported data
    >>>>> available to the downstream pipeline components ordered, only about
    >>>>> the simpler implementation.
    >>>>> Association with the file should be possible I hope, but I understand
    >>>>> it would be possible to optionally make the data coming out in the
    >>>>> ordered way as well...
    >>>>>
    >>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
    >>>>> let me double check: should we still give some thought to the
    >>>>> possible performance benefit of the current approach ? As I said, I
    >>>>> can easily get rid of all that polling code, use a simple Blocking
    >> queue.
    >>>>>
    >>>>> Cheers, Sergey
    >>>>>>
    >>>>>> Thanks.
    >>>>>>
    >>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
    >>>>>> <[email protected]>
    >>>>>> wrote:
    >>>>>>
    >>>>>>> Hi
    >>>>>>>
    >>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
    >>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
    >>>>>>>
    >>>>>>> Before trying to reply online, I'd like to state that my main
    >>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
    >>>>>>> no different to Text, XML or similar bounded reader components.
    >>>>>>>
    >>>>>>> I have to admit I don't understand your questions about TikaIO
    >>>>>>> usecases.
    >>>>>>>
    >>>>>>> What are the Text Input or XML input use-cases ? These use cases
    >>>>>>> are TikaInput cases as well, the only difference is Tika can not
    >>>>>>> split the individual file into a sequence of sources/etc,
    >>>>>>>
    >>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
    >>>>>>> optimized around reading from the XML files, and I thought I made
    >>>>>>> it clear (and it is a known fact anyway) Tika was about reading
    >>>>>>> basically from any file format.
    >>>>>>>
    >>>>>>> Where is the difference (apart from what I've already mentioned) ?
    >>>>>>>
    >>>>>>> Sergey
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
    >>>>>>>> Hi,
    >>>>>>>>
    >>>>>>>> Replies inline.
    >>>>>>>>
    >>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
    >>>>>>>> <[email protected]>
    >>>>>>>> wrote:
    >>>>>>>>
    >>>>>>>>> Hi All
    >>>>>>>>>
    >>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
    >>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
    >>>>>>>>> great to try and link both projects together, which led me to
    >>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
    >>>>>>>>> [2].
    >>>>>>>>>
    >>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
    >>>>>>>>> newer review comments from Eugene pending, so I'd like to
    >>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
    >>>>>>>>> decide, based on the feedback from the experts, what to do next.
    >>>>>>>>>
    >>>>>>>>> Apache Tika Parsers report the text content in chunks, via
    >>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
    >>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
    >>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
    >>>>>>>>> report the data chunks.
    >>>>>>>>> Some
    >>>>>>>>> parsers may report the complete lines, some individual words,
    >>>>>>>>> with some being able report the data only after the completely
    >>>>>>>>> parse the document.
    >>>>>>>>> All depends on the data format.
    >>>>>>>>>
    >>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
    >>>>>>>>> to parse the files, Beam threads will only collect the data from
    >>>>>>>>> the internal queue where the internal TikaReader's thread will
    >>>>>>>>> put the data into (note the data chunks are ordered even though
    >>>>>>>>> the tests might suggest otherwise).
    >>>>>>>>>
    >>>>>>>> I agree that your implementation of reader returns records in
    >>>>>>>> order
    >>>>>>>> - but
    >>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
    >>>>>>>> the order in which records are produced by a BoundedReader - the
    >>>>>>>> order produced by your reader is ignored, and when applying any
    >>>>>>>> transforms to the
    >>>>>>> PCollection
    >>>>>>>> produced by TikaIO, it is impossible to recover the order in which
    >>>>>>>> your reader returned the records.
    >>>>>>>>
    >>>>>>>> With that in mind, is PCollection<String>, containing individual
    >>>>>>>> Tika-detected items, still the right API for representing the
    >>>>>>>> result of parsing a large number of documents with Tika?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> The reason I did it was because I thought
    >>>>>>>>>
    >>>>>>>>> 1) it would make the individual data chunks available faster to
    >>>>>>>>> the pipeline - the parser will continue working via the
    >>>>>>>>> binary/video etc file while the data will already start flowing -
    >>>>>>>>> I agree there should be some tests data available confirming it -
    >>>>>>>>> but I'm positive at the moment this approach might yield some
    >>>>>>>>> performance gains with the large sets. If the file is large, if
    >>>>>>>>> it has the embedded attachments/videos to deal with, then it may
    >>>>>>>>> be more effective not to get the Beam thread deal with it...
    >>>>>>>>>
    >>>>>>>>> As I said on the PR, this description contains unfounded and
    >>>>>>>>> potentially
    >>>>>>>> incorrect assumptions about how Beam runners execute (or may
    >>>>>>>> execute in
    >>>>>>> the
    >>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
    >>>>>>> correctly,
    >>>>>>>> you might be assuming that:
    >>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
    >>>>>>> complete
    >>>>>>>> before processing its outputs with downstream transforms
    >>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
    >>>>>>> *concurrently*
    >>>>>>>> with downstream processing of its results
    >>>>>>>> - Passing an element from one thread to another using a
    >>>>>>>> BlockingQueue is free in terms of performance All of these are
    >>>>>>>> false at least in some runners, and I'm almost certain that in
    >>>>>>>> reality, performance of this approach is worse than a ParDo in
    >>>>>>> most
    >>>>>>>> production runners.
    >>>>>>>>
    >>>>>>>> There are other disadvantages to this approach:
    >>>>>>>> - Doing the bulk of the processing in a separate thread makes it
    >>>>>>> invisible
    >>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
    >>>>>>>> profiling capabilities, or the ability to get the current stack
    >>>>>>>> trace for stuck elements, this approach would make the real
    >>>>>>>> processing invisible to all of these capabilities, and a user
    >>>>>>>> would only see that the bulk of the time is spent waiting for the
    >>>>>>>> next element, but not *why* the next
    >>>>>>> element
    >>>>>>>> is taking long to compute.
    >>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
    >>>>>>>> invisible to Beam, will make it harder for runners to do
    >>>>>>>> autoscaling, binpacking
    >>>>>>> and
    >>>>>>>> other resource management magic (how much of this runners actually
    >>>>>>>> do is
    >>>>>>> a
    >>>>>>>> separate issue), because the runner will have no way of knowing
    >>>>>>>> how much CPU/IO this particular transform is actually using - all
    >>>>>>>> the processing happens in a thread about which the runner is
    >>>>>>>> unaware.
    >>>>>>>> - As far as I can tell, the code also hides exceptions that happen
    >>>>>>>> in the Tika thread
    >>>>>>>> - Adding the thread management makes the code much more complex,
    >>>>>>>> easier
    >>>>>>> to
    >>>>>>>> introduce bugs, and harder for others to contribute
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> 2) As I commented at the end of [2], having an option to
    >>>>>>>>> concatenate the data chunks first before making them available to
    >>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
    >>>>>>>>> introduce some synchronization issues (though not exactly sure
    >>>>>>>>> yet)
    >>>>>>>>>
    >>>>>>>> What are these issues?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> One of valid concerns there is that the reader is polling the
    >>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
    >>>>>>>>> cases too, we may have a case where the max polling time has been
    >>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
    >>>>>>>>> the file data. I think that it can be solved by either 2a)
    >>>>>>>>> configuring the max polling time to a very large number which
    >>>>>>>>> will never be reached for a practical case, or
    >>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
    >>>>>>>>> worst case, if TikaParser spins and fails to report the end of
    >>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
    >>>>>>>>> I propose to follow 2b).
    >>>>>>>>>
    >>>>>>>> I agree that there should be no way to unintentionally configure
    >>>>>>>> the transform in a way that will produce silent data loss. Another
    >>>>>>>> reason for not having these tuning knobs is that it goes against
    >>>>>>>> Beam's "no knobs"
    >>>>>>>> philosophy, and that in most cases users have no way of figuring
    >>>>>>>> out a
    >>>>>>> good
    >>>>>>>> value for tuning knobs except for manual experimentation, which is
    >>>>>>>> extremely brittle and typically gets immediately obsoleted by
    >>>>>>>> running on
    >>>>>>> a
    >>>>>>>> new dataset or updating a version of some of the involved
    >>>>>>>> dependencies
    >>>>>>> etc.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Please let me know what you think.
    >>>>>>>>> My plan so far is:
    >>>>>>>>> 1) start addressing most of Eugene's comments which would require
    >>>>>>>>> some minor TikaIO updates
    >>>>>>>>> 2) work on removing the TikaSource internal code dealing with
    >>>>>>>>> File patterns which I copied from TextIO at the next stage
    >>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
    >>>>>>>>> users some time to try it with some real complex files and also
    >>>>>>>>> decide if TikaIO can continue implemented as a
    >>>>>>>>> BoundedSource/Reader or not
    >>>>>>>>>
    >>>>>>>>> Eugene, all, will it work if I start with 1) ?
    >>>>>>>>>
    >>>>>>>> Yes, but I think we should start by discussing the anticipated use
    >>>>>>>> cases
    >>>>>>> of
    >>>>>>>> TikaIO and designing an API for it based on those use cases; and
    >>>>>>>> then see what's the best implementation for that particular API
    >>>>>>>> and set of anticipated use cases.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Thanks, Sergey
    >>>>>>>>>
    >>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >>>>>>>>> [2] https://github.com/apache/beam/pull/3378
    >>>>>>>>>
    >>>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>
    >>
    > 
    
    
    -- 
    Sergey Beryozkin
    
    Talend Community Coders
    http://coders.talend.com/

Re: TikaIO concerns

Reply via email to