Hi all,

One other thing is that Tika extracts metadata, and language information in 
which order
doesn’t matter (Keys can be out of order).

Would this be useful?

Cheers,
Chris




On 9/21/17, 2:10 PM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:

    Hi Eugene
    
    Thank you, very helpful, let me read it few times before I get what 
    exactly I need to clarify :-), two questions so far:
    
    On 21/09/17 21:40, Eugene Kirpichov wrote:
    > Thanks all for the discussion. It seems we have consensus that both
    > within-document order and association with the original filename are
    > necessary, but currently absent from TikaIO.
    > 
    > *Association with original file:*
    > Sergey - Beam does not *automatically* provide a way to associate an
    > element with the file it originated from: automatically tracking data
    > provenance is a known very hard research problem on which many papers have
    > been written, and obvious solutions are very easy to break. See related
    > discussion at
    > 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
    >   .
    > 
    > If you want the elements of your PCollection to contain additional
    > information, you need the elements themselves to contain this information:
    > the elements are self-contained and have no metadata associated with them
    > (beyond the timestamp and windows, universal to the whole Beam model).
    > 
    > *Order within a file:*
    > The only way to have any kind of order within a PCollection is to have the
    > elements of the PCollection contain something ordered, e.g. have a
    > PCollection<List<Something>>, where each List is for one file [I'm 
assuming
    > Tika, at a low level, works on a per-file basis?]. However, since TikaIO
    > can be applied to very large files, this could produce very large 
elements,
    > which is a bad idea. Because of this, I don't think the result of applying
    > Tika to a single file can be encoded as a PCollection element.
    > 
    > Given both of these, I think that it's not possible to create a
    > *general-purpose* TikaIO transform that will be better than manual
    > invocation of Tika as a DoFn on the result of FileIO.readMatches().
    > 
    > However, looking at the examples at
    > https://tika.apache.org/1.16/examples.html - almost all of the examples
    > involve extracting a single String from each document. This use case, with
    > the assumption that individual documents are small enough, can certainly 
be
    > simplified and TikaIO could be a facade for doing just this.
    > 
    > E.g. TikaIO could:
    > - take as input a PCollection<ReadableFile>
    > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
    > is a class with properties { String content, Metadata metadata }
    
    and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
    represents the content + (Tika) Metadata of the file such as the author 
    name, etc ? Is it the file name ?
    > - be configured by: a Parser (it implements Serializable so can be
    > specified at pipeline construction time) and a ContentHandler whose
    > toString() will go into "content". ContentHandler does not implement
    > Serializable, so you can not specify it at construction time - however, 
you
    > can let the user specify either its class (if it's a simple handler like a
    > BodyContentHandler) or specify a lambda for creating the handler
    > (SerializableFunction<Void, ContentHandler>), and potentially you can have
    > a simpler facade for Tika.parseAsString() - e.g. call it
    > TikaIO.parseAllAsStrings().
    > 
    > Example usage would look like:
    > 
    >    PCollection<KV<String, ParseResult>> parseResults =
    > p.apply(FileIO.match().filepattern(...))
    >      .apply(FileIO.readMatches())
    >      .apply(TikaIO.parseAllAsStrings())
    > 
    > or:
    > 
    >      .apply(TikaIO.parseAll()
    >          .withParser(new AutoDetectParser())
    >          .withContentHandler(() -> new BodyContentHandler(new
    > ToXMLContentHandler())))
    > 
    > You could also have shorthands for letting the user avoid using FileIO
    > directly in simple cases, for example:
    >      p.apply(TikaIO.parseAsStrings().from(filepattern))
    > 
    > This would of course be implemented as a ParDo or even MapElements, and
    > you'll be able to share the code between parseAll and regular parse.
    > 
    OK. What about the current source on the master, should be marked 
    Experimental till I manage to write something new with the above ideas 
    in mind ? Or there's enough time till 2.2.0 gets released ?
    
    Thanks, Sergey
    > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com>
    > wrote:
    > 
    >> Hi Tim
    >> On 21/09/17 14:33, Allison, Timothy B. wrote:
    >>> Thank you, Sergey.
    >>>
    >>> My knowledge of Apache Beam is limited -- I saw Davor and
    >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
    >> impressed, but I haven't had a chance to work with it yet.
    >>>
    >>>   From my perspective, if I understand this thread (and I may not!),
    >> getting unordered text from _a given file_ is a non-starter for most
    >> applications.  The implementation needs to guarantee order per file, and
    >> the user has to be able to link the "extract" back to a unique identifier
    >> for the document.  If the current implementation doesn't do those things,
    >> we need to change it, IMHO.
    >>>
    >> Right now Tika-related reader does not associate a given text fragment
    >> with the file name, so a function looking at some text and trying to
    >> find where it came from won't be able to do so.
    >>
    >> So I asked how to do it in Beam, how to attach some context to the given
    >> piece of data. I hope it can be done and if not - then perhaps some
    >> improvement can be applied.
    >>
    >> Re the unordered text - yes - this is what we currently have with Beam +
    >> TikaIO :-).
    >>
    >> The use-case I referred to earlier in this thread (upload PDFs - save
    >> the possibly unordered text to Lucene with the file name 'attached', let
    >> users search for the files containing some words - phrases, this works
    >> OK given that I can see PDF parser for ex reporting the lines) can be
    >> supported OK with the current TikaIO (provided we find a way to 'attach'
    >> a file name to the flow).
    >>
    >> I see though supporting the total ordering can be a big deal in other
    >> cases. Eugene, can you please explain how it can be done, is it
    >> achievable in principle, without the users having to do some custom
    >> coding ?
    >>
    >>> To the question of -- why is this in Beam at all; why don't we let users
    >> call it if they want it?...
    >>>
    >>> No matter how much we do to Tika, it will behave badly sometimes --
    >> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine 
folks
    >> using Beam -- folks likely with large batches of unruly/noisy documents 
--
    >> are more likely to run into these problems than your average
    >> couple-of-thousand-docs-from-our-own-company user. So, if there are 
things
    >> we can do in Beam to prevent developers around the world from having to
    >> reinvent the wheel for defenses against these problems, then I'd be
    >> enormously grateful if we could put Tika into Beam.  That means:
    >>>
    >>> 1) a process-level timeout (because you can't actually kill a thread in
    >> Java)
    >>> 2) a process-level restart on OOM
    >>> 3) avoid trying to reprocess a badly behaving document
    >>>
    >>> If Beam automatically handles those problems, then I'd say, y, let users
    >> write their own code.  If there is so much as a single configuration knob
    >> (and it sounds like Beam is against complex configuration...yay!) to get
    >> that working in Beam, then I'd say, please integrate Tika into Beam.  
From
    >> a safety perspective, it is critical to keep the extraction process
    >> entirely separate (jvm, vm, m, rack, data center!) from the
    >> transformation+loading steps.  IMHO, very few devs realize this because
    >> Tika works well lots of the time...which is why it is critical for us to
    >> make it easy for people to get it right all of the time.
    >>>
    >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
    >> mode first in one jvm, and then I kick off another process to do
    >> transform/loading into Lucene/Solr from the .json files that Tika 
generates
    >> for each input file.  If I were to scale up, I'd want to maintain this
    >> complete separation of steps.
    >>>
    >>> Apologies if I've derailed the conversation or misunderstood this 
thread.
    >>>
    >> Major thanks for your input :-)
    >>
    >> Cheers, Sergey
    >>
    >>> Cheers,
    >>>
    >>>                  Tim
    >>>
    >>> -----Original Message-----
    >>> From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
    >>> Sent: Thursday, September 21, 2017 9:07 AM
    >>> To: dev@beam.apache.org
    >>> Cc: Allison, Timothy B. <talli...@mitre.org>
    >>> Subject: Re: TikaIO concerns
    >>>
    >>> Hi All
    >>>
    >>> Please welcome Tim, one of Apache Tika leads and practitioners.
    >>>
    >>> Tim, thanks for joining in :-). If you have some great Apache Tika
    >> stories to share (preferably involving the cases where it did not really
    >> matter the ordering in which Tika-produced data were dealt with by the
    >>> consumers) then please do so :-).
    >>>
    >>> At the moment, even though Tika ContentHandler will emit the ordered
    >> data, the Beam runtime will have no guarantees that the downstream 
pipeline
    >> components will see the data coming in the right order.
    >>>
    >>> (FYI, I understand from the earlier comments that the total ordering is
    >> also achievable but would require the extra API support)
    >>>
    >>> Other comments would be welcome too
    >>>
    >>> Thanks, Sergey
    >>>
    >>> On 21/09/17 10:55, Sergey Beryozkin wrote:
    >>>> I noticed that the PDF and ODT parsers actually split by lines, not
    >>>> individual words and nearly 100% sure I saw Tika reporting individual
    >>>> lines when it was parsing the text files. The 'min text length'
    >>>> feature can help with reporting several lines at a time, etc...
    >>>>
    >>>> I'm working with this PDF all the time:
    >>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
    >>>>
    >>>> try it too if you get a chance.
    >>>>
    >>>> (and I can imagine not all PDFs/etc representing the 'story' but can
    >>>> be for ex a log-like content too)
    >>>>
    >>>> That said, I don't know how a parser for the format N will behave, it
    >>>> depends on the individual parsers.
    >>>>
    >>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
    >>>>
    >>>> I'd like to know though how to make a file name available to the
    >>>> pipeline which is working with the current text fragment ?
    >>>>
    >>>> Going to try and do some measurements and compare the sync vs async
    >>>> parsing modes...
    >>>>
    >>>> Asked the Tika team to support with some more examples...
    >>>>
    >>>> Cheers, Sergey
    >>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
    >>>>> Hi,
    >>>>>
    >>>>> thanks for the explanations,
    >>>>>
    >>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
    >>>>>> Hi!
    >>>>>>
    >>>>>> TextIO returns an unordered soup of lines contained in all files you
    >>>>>> ask it to read. People usually use TextIO for reading files where 1
    >>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
    >>>>>> a row of a CSV file - so discarding order is ok.
    >>>>> Just a side note, I'd probably want that be ordered, though I guess
    >>>>> it depends...
    >>>>>> However, there is a number of cases where TextIO is a poor fit:
    >>>>>> - Cases where discarding order is not ok - e.g. if you're doing
    >>>>>> natural language processing and the text files contain actual prose,
    >>>>>> where you need to process a file as a whole. TextIO can't do that.
    >>>>>> - Cases where you need to remember which file each element came
    >>>>>> from, e.g.
    >>>>>> if you're creating a search index for the files: TextIO can't do
    >>>>>> this either.
    >>>>>>
    >>>>>> Both of these issues have been raised in the past against TextIO;
    >>>>>> however it seems that the overwhelming majority of users of TextIO
    >>>>>> use it for logs or CSV files or alike, so solving these issues has
    >>>>>> not been a priority.
    >>>>>> Currently they are solved in a general form via FileIO.read() which
    >>>>>> gives you access to reading a full file yourself - people who want
    >>>>>> more flexibility will be able to use standard Java text-parsing
    >>>>>> utilities on a ReadableFile, without involving TextIO.
    >>>>>>
    >>>>>> Same applies for XmlIO: it is specifically designed for the narrow
    >>>>>> use case where the files contain independent data entries, so
    >>>>>> returning an unordered soup of them, with no association to the
    >>>>>> original file, is the user's intention. XmlIO will not work for
    >>>>>> processing more complex XML files that are not simply a sequence of
    >>>>>> entries with the same tag, and it also does not remember the
    >>>>>> original filename.
    >>>>>>
    >>>>>
    >>>>> OK...
    >>>>>
    >>>>>> However, if my understanding of Tika use cases is correct, it is
    >>>>>> mainly used for extracting content from complex file formats - for
    >>>>>> example, extracting text and images from PDF files or Word
    >>>>>> documents. I believe this is the main difference between it and
    >>>>>> TextIO - people usually use Tika for complex use cases where the
    >>>>>> "unordered soup of stuff" abstraction is not useful.
    >>>>>>
    >>>>>> My suspicion about this is confirmed by the fact that the crux of
    >>>>>> the Tika API is ContentHandler
    >>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
    >>>>>> html?is-external=true
    >>>>>>
    >>>>>> whose
    >>>>>> documentation says "The order of events in this interface is very
    >>>>>> important, and mirrors the order of information in the document
    >> itself."
    >>>>> All that says is that a (Tika) ContentHandler will be a true SAX
    >>>>> ContentHandler...
    >>>>>>
    >>>>>> Let me give a few examples of what I think is possible with the raw
    >>>>>> Tika API, but I think is not currently possible with TikaIO - please
    >>>>>> correct me where I'm wrong, because I'm not particularly familiar
    >>>>>> with Tika and am judging just based on what I read about it.
    >>>>>> - User has 100,000 Word documents and wants to convert each of them
    >>>>>> to text files for future natural language processing.
    >>>>>> - User has 100,000 PDF files with financial statements, each
    >>>>>> containing a bunch of unrelated text and - the main content - a list
    >>>>>> of transactions in PDF tables. User wants to extract each
    >>>>>> transaction as a PCollection element, discarding the unrelated text.
    >>>>>> - User has 100,000 PDF files with scientific papers, and wants to
    >>>>>> extract text from them, somehow parse author and affiliation from
    >>>>>> the text, and compute statistics of topics and terminology usage by
    >>>>>> author name and affiliation.
    >>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
    >>>>>> observing a location over time: they want to extract metadata from
    >>>>>> each image using Tika, analyze the images themselves using some
    >>>>>> other library, and detect anomalies in the overall appearance of the
    >>>>>> location over time as seen from multiple cameras.
    >>>>>> I believe all of these cases can not be solved with TikaIO because
    >>>>>> the resulting PCollection<String> contains no information about
    >>>>>> which String comes from which document and about the order in which
    >>>>>> they appear in the document.
    >>>>> These are good use cases, thanks... I thought what you were talking
    >>>>> about the unordered soup of data produced by TikaIO (and its friends
    >>>>> TextIO and alike :-)).
    >>>>> Putting the ordered vs unordered question aside for a sec, why
    >>>>> exactly a Tika Reader can not make the name of the file it's
    >>>>> currently reading from available to the pipeline, as some Beam
    >> pipeline metadata piece ?
    >>>>> Surely it can be possible with Beam ? If not then I would be
    >> surprised...
    >>>>>
    >>>>>>
    >>>>>> I am, honestly, struggling to think of a case where I would want to
    >>>>>> use Tika, but where I *would* be ok with getting an unordered soup
    >>>>>> of strings.
    >>>>>> So some examples would be very helpful.
    >>>>>>
    >>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
    >>>>> give one example where it did not matter to us in what order
    >>>>> Tika-produced data were available to the downstream layer.
    >>>>>
    >>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
    >>>>> Con NAs, and we had a happy audience:
    >>>>>
    >>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search
    >>>>>
    >>>>>
    >>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
    >>>>> into Lucene. We associate a file name with the indexed content and
    >>>>> then let users find a list of PDF files which contain a given word or
    >>>>> few words, details are here
    >>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
    >>>>> og.java#L131
    >>>>>
    >>>>>
    >>>>> I'd say even more involved search engines would not mind supporting a
    >>>>> case like that :-)
    >>>>>
    >>>>> Now there we process one file at a time, and I understand now that
    >>>>> with TikaIO and N files it's all over the place really as far as the
    >>>>> ordering is concerned, which file it's coming from. etc. That's why
    >>>>> TikaReader must be able to associate the file name with a given piece
    >>>>> of text it's making available to the pipeline.
    >>>>>
    >>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
    >>>>> If it makes things simpler then it would be good, I've just no idea
    >>>>> at the moment how to start the pipeline without using a
    >>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
    >>>>> earlier - how can one avoid it with ParDo when implementing a 'min
    >>>>> len chunk' feature, where the ParDo would have to concatenate several
    >>>>> SAX data pieces first before making a single composite piece to the
    >> pipeline ?
    >>>>>
    >>>>>
    >>>>>> Another way to state it: currently, if I wanted to solve all of the
    >>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
    >>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
    >>>>>> provide a usability improvement over such usage?
    >>>>>>
    >>>>>
    >>>>>
    >>>>> If you are actually asking, does it really make sense for Beam to
    >>>>> ship Tika related code, given that users can just do it themselves,
    >>>>> I'm not sure.
    >>>>>
    >>>>> IMHO it always works better if users have to provide just few config
    >>>>> options to an integral part of the framework and see things happening.
    >>>>> It will bring more users.
    >>>>>
    >>>>> Whether the current Tika code (refactored or not) stays with Beam or
    >>>>> not - I'll let you and the team decide; believe it or not I was
    >>>>> seriously contemplating at the last moment to make it all part of the
    >>>>> Tika project itself and have a bit more flexibility over there with
    >>>>> tweaking things, but now that it is in the Beam snapshot - I don't
    >>>>> know - it's no my decision...
    >>>>>
    >>>>>> I am confused by your other comment - "Does the ordering matter ?
    >>>>>> Perhaps
    >>>>>> for some cases it does, and for some it does not. May be it makes
    >>>>>> sense to support running TikaIO as both the bounded reader/source
    >>>>>> and ParDo, with getting the common code reused." - because using
    >>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
    >>>>>> the issue of asynchronous reading and complexity of implementation.
    >>>>>> The resulting PCollection will be unordered either way - this needs
    >>>>>> to be solved separately by providing a different API.
    >>>>> Right I see now, so ParDo is not about making Tika reported data
    >>>>> available to the downstream pipeline components ordered, only about
    >>>>> the simpler implementation.
    >>>>> Association with the file should be possible I hope, but I understand
    >>>>> it would be possible to optionally make the data coming out in the
    >>>>> ordered way as well...
    >>>>>
    >>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
    >>>>> let me double check: should we still give some thought to the
    >>>>> possible performance benefit of the current approach ? As I said, I
    >>>>> can easily get rid of all that polling code, use a simple Blocking
    >> queue.
    >>>>>
    >>>>> Cheers, Sergey
    >>>>>>
    >>>>>> Thanks.
    >>>>>>
    >>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
    >>>>>> <sberyoz...@gmail.com>
    >>>>>> wrote:
    >>>>>>
    >>>>>>> Hi
    >>>>>>>
    >>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
    >>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
    >>>>>>>
    >>>>>>> Before trying to reply online, I'd like to state that my main
    >>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
    >>>>>>> no different to Text, XML or similar bounded reader components.
    >>>>>>>
    >>>>>>> I have to admit I don't understand your questions about TikaIO
    >>>>>>> usecases.
    >>>>>>>
    >>>>>>> What are the Text Input or XML input use-cases ? These use cases
    >>>>>>> are TikaInput cases as well, the only difference is Tika can not
    >>>>>>> split the individual file into a sequence of sources/etc,
    >>>>>>>
    >>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
    >>>>>>> optimized around reading from the XML files, and I thought I made
    >>>>>>> it clear (and it is a known fact anyway) Tika was about reading
    >>>>>>> basically from any file format.
    >>>>>>>
    >>>>>>> Where is the difference (apart from what I've already mentioned) ?
    >>>>>>>
    >>>>>>> Sergey
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
    >>>>>>>> Hi,
    >>>>>>>>
    >>>>>>>> Replies inline.
    >>>>>>>>
    >>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
    >>>>>>>> <sberyoz...@gmail.com>
    >>>>>>>> wrote:
    >>>>>>>>
    >>>>>>>>> Hi All
    >>>>>>>>>
    >>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
    >>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
    >>>>>>>>> great to try and link both projects together, which led me to
    >>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
    >>>>>>>>> [2].
    >>>>>>>>>
    >>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
    >>>>>>>>> newer review comments from Eugene pending, so I'd like to
    >>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
    >>>>>>>>> decide, based on the feedback from the experts, what to do next.
    >>>>>>>>>
    >>>>>>>>> Apache Tika Parsers report the text content in chunks, via
    >>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
    >>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
    >>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
    >>>>>>>>> report the data chunks.
    >>>>>>>>> Some
    >>>>>>>>> parsers may report the complete lines, some individual words,
    >>>>>>>>> with some being able report the data only after the completely
    >>>>>>>>> parse the document.
    >>>>>>>>> All depends on the data format.
    >>>>>>>>>
    >>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
    >>>>>>>>> to parse the files, Beam threads will only collect the data from
    >>>>>>>>> the internal queue where the internal TikaReader's thread will
    >>>>>>>>> put the data into (note the data chunks are ordered even though
    >>>>>>>>> the tests might suggest otherwise).
    >>>>>>>>>
    >>>>>>>> I agree that your implementation of reader returns records in
    >>>>>>>> order
    >>>>>>>> - but
    >>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
    >>>>>>>> the order in which records are produced by a BoundedReader - the
    >>>>>>>> order produced by your reader is ignored, and when applying any
    >>>>>>>> transforms to the
    >>>>>>> PCollection
    >>>>>>>> produced by TikaIO, it is impossible to recover the order in which
    >>>>>>>> your reader returned the records.
    >>>>>>>>
    >>>>>>>> With that in mind, is PCollection<String>, containing individual
    >>>>>>>> Tika-detected items, still the right API for representing the
    >>>>>>>> result of parsing a large number of documents with Tika?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> The reason I did it was because I thought
    >>>>>>>>>
    >>>>>>>>> 1) it would make the individual data chunks available faster to
    >>>>>>>>> the pipeline - the parser will continue working via the
    >>>>>>>>> binary/video etc file while the data will already start flowing -
    >>>>>>>>> I agree there should be some tests data available confirming it -
    >>>>>>>>> but I'm positive at the moment this approach might yield some
    >>>>>>>>> performance gains with the large sets. If the file is large, if
    >>>>>>>>> it has the embedded attachments/videos to deal with, then it may
    >>>>>>>>> be more effective not to get the Beam thread deal with it...
    >>>>>>>>>
    >>>>>>>>> As I said on the PR, this description contains unfounded and
    >>>>>>>>> potentially
    >>>>>>>> incorrect assumptions about how Beam runners execute (or may
    >>>>>>>> execute in
    >>>>>>> the
    >>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
    >>>>>>> correctly,
    >>>>>>>> you might be assuming that:
    >>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
    >>>>>>> complete
    >>>>>>>> before processing its outputs with downstream transforms
    >>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
    >>>>>>> *concurrently*
    >>>>>>>> with downstream processing of its results
    >>>>>>>> - Passing an element from one thread to another using a
    >>>>>>>> BlockingQueue is free in terms of performance All of these are
    >>>>>>>> false at least in some runners, and I'm almost certain that in
    >>>>>>>> reality, performance of this approach is worse than a ParDo in
    >>>>>>> most
    >>>>>>>> production runners.
    >>>>>>>>
    >>>>>>>> There are other disadvantages to this approach:
    >>>>>>>> - Doing the bulk of the processing in a separate thread makes it
    >>>>>>> invisible
    >>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
    >>>>>>>> profiling capabilities, or the ability to get the current stack
    >>>>>>>> trace for stuck elements, this approach would make the real
    >>>>>>>> processing invisible to all of these capabilities, and a user
    >>>>>>>> would only see that the bulk of the time is spent waiting for the
    >>>>>>>> next element, but not *why* the next
    >>>>>>> element
    >>>>>>>> is taking long to compute.
    >>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
    >>>>>>>> invisible to Beam, will make it harder for runners to do
    >>>>>>>> autoscaling, binpacking
    >>>>>>> and
    >>>>>>>> other resource management magic (how much of this runners actually
    >>>>>>>> do is
    >>>>>>> a
    >>>>>>>> separate issue), because the runner will have no way of knowing
    >>>>>>>> how much CPU/IO this particular transform is actually using - all
    >>>>>>>> the processing happens in a thread about which the runner is
    >>>>>>>> unaware.
    >>>>>>>> - As far as I can tell, the code also hides exceptions that happen
    >>>>>>>> in the Tika thread
    >>>>>>>> - Adding the thread management makes the code much more complex,
    >>>>>>>> easier
    >>>>>>> to
    >>>>>>>> introduce bugs, and harder for others to contribute
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> 2) As I commented at the end of [2], having an option to
    >>>>>>>>> concatenate the data chunks first before making them available to
    >>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
    >>>>>>>>> introduce some synchronization issues (though not exactly sure
    >>>>>>>>> yet)
    >>>>>>>>>
    >>>>>>>> What are these issues?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> One of valid concerns there is that the reader is polling the
    >>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
    >>>>>>>>> cases too, we may have a case where the max polling time has been
    >>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
    >>>>>>>>> the file data. I think that it can be solved by either 2a)
    >>>>>>>>> configuring the max polling time to a very large number which
    >>>>>>>>> will never be reached for a practical case, or
    >>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
    >>>>>>>>> worst case, if TikaParser spins and fails to report the end of
    >>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
    >>>>>>>>> I propose to follow 2b).
    >>>>>>>>>
    >>>>>>>> I agree that there should be no way to unintentionally configure
    >>>>>>>> the transform in a way that will produce silent data loss. Another
    >>>>>>>> reason for not having these tuning knobs is that it goes against
    >>>>>>>> Beam's "no knobs"
    >>>>>>>> philosophy, and that in most cases users have no way of figuring
    >>>>>>>> out a
    >>>>>>> good
    >>>>>>>> value for tuning knobs except for manual experimentation, which is
    >>>>>>>> extremely brittle and typically gets immediately obsoleted by
    >>>>>>>> running on
    >>>>>>> a
    >>>>>>>> new dataset or updating a version of some of the involved
    >>>>>>>> dependencies
    >>>>>>> etc.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Please let me know what you think.
    >>>>>>>>> My plan so far is:
    >>>>>>>>> 1) start addressing most of Eugene's comments which would require
    >>>>>>>>> some minor TikaIO updates
    >>>>>>>>> 2) work on removing the TikaSource internal code dealing with
    >>>>>>>>> File patterns which I copied from TextIO at the next stage
    >>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
    >>>>>>>>> users some time to try it with some real complex files and also
    >>>>>>>>> decide if TikaIO can continue implemented as a
    >>>>>>>>> BoundedSource/Reader or not
    >>>>>>>>>
    >>>>>>>>> Eugene, all, will it work if I start with 1) ?
    >>>>>>>>>
    >>>>>>>> Yes, but I think we should start by discussing the anticipated use
    >>>>>>>> cases
    >>>>>>> of
    >>>>>>>> TikaIO and designing an API for it based on those use cases; and
    >>>>>>>> then see what's the best implementation for that particular API
    >>>>>>>> and set of anticipated use cases.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Thanks, Sergey
    >>>>>>>>>
    >>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >>>>>>>>> [2] https://github.com/apache/beam/pull/3378
    >>>>>>>>>
    >>>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>
    >>
    > 
    
    
    -- 
    Sergey Beryozkin
    
    Talend Community Coders
    http://coders.talend.com/
    


Reply via email to