Hi all, One other thing is that Tika extracts metadata, and language information in which order doesn’t matter (Keys can be out of order).
Would this be useful? Cheers, Chris On 9/21/17, 2:10 PM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote: Hi Eugene Thank you, very helpful, let me read it few times before I get what exactly I need to clarify :-), two questions so far: On 21/09/17 21:40, Eugene Kirpichov wrote: > Thanks all for the discussion. It seems we have consensus that both > within-document order and association with the original filename are > necessary, but currently absent from TikaIO. > > *Association with original file:* > Sergey - Beam does not *automatically* provide a way to associate an > element with the file it originated from: automatically tracking data > provenance is a known very hard research problem on which many papers have > been written, and obvious solutions are very easy to break. See related > discussion at > https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E > . > > If you want the elements of your PCollection to contain additional > information, you need the elements themselves to contain this information: > the elements are self-contained and have no metadata associated with them > (beyond the timestamp and windows, universal to the whole Beam model). > > *Order within a file:* > The only way to have any kind of order within a PCollection is to have the > elements of the PCollection contain something ordered, e.g. have a > PCollection<List<Something>>, where each List is for one file [I'm assuming > Tika, at a low level, works on a per-file basis?]. However, since TikaIO > can be applied to very large files, this could produce very large elements, > which is a bad idea. Because of this, I don't think the result of applying > Tika to a single file can be encoded as a PCollection element. > > Given both of these, I think that it's not possible to create a > *general-purpose* TikaIO transform that will be better than manual > invocation of Tika as a DoFn on the result of FileIO.readMatches(). > > However, looking at the examples at > https://tika.apache.org/1.16/examples.html - almost all of the examples > involve extracting a single String from each document. This use case, with > the assumption that individual documents are small enough, can certainly be > simplified and TikaIO could be a facade for doing just this. > > E.g. TikaIO could: > - take as input a PCollection<ReadableFile> > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult > is a class with properties { String content, Metadata metadata } and what is the 'String' in KV<String,...> given that TikaIO.ParseResult represents the content + (Tika) Metadata of the file such as the author name, etc ? Is it the file name ? > - be configured by: a Parser (it implements Serializable so can be > specified at pipeline construction time) and a ContentHandler whose > toString() will go into "content". ContentHandler does not implement > Serializable, so you can not specify it at construction time - however, you > can let the user specify either its class (if it's a simple handler like a > BodyContentHandler) or specify a lambda for creating the handler > (SerializableFunction<Void, ContentHandler>), and potentially you can have > a simpler facade for Tika.parseAsString() - e.g. call it > TikaIO.parseAllAsStrings(). > > Example usage would look like: > > PCollection<KV<String, ParseResult>> parseResults = > p.apply(FileIO.match().filepattern(...)) > .apply(FileIO.readMatches()) > .apply(TikaIO.parseAllAsStrings()) > > or: > > .apply(TikaIO.parseAll() > .withParser(new AutoDetectParser()) > .withContentHandler(() -> new BodyContentHandler(new > ToXMLContentHandler()))) > > You could also have shorthands for letting the user avoid using FileIO > directly in simple cases, for example: > p.apply(TikaIO.parseAsStrings().from(filepattern)) > > This would of course be implemented as a ParDo or even MapElements, and > you'll be able to share the code between parseAll and regular parse. > OK. What about the current source on the master, should be marked Experimental till I manage to write something new with the above ideas in mind ? Or there's enough time till 2.2.0 gets released ? Thanks, Sergey > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com> > wrote: > >> Hi Tim >> On 21/09/17 14:33, Allison, Timothy B. wrote: >>> Thank you, Sergey. >>> >>> My knowledge of Apache Beam is limited -- I saw Davor and >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally >> impressed, but I haven't had a chance to work with it yet. >>> >>> From my perspective, if I understand this thread (and I may not!), >> getting unordered text from _a given file_ is a non-starter for most >> applications. The implementation needs to guarantee order per file, and >> the user has to be able to link the "extract" back to a unique identifier >> for the document. If the current implementation doesn't do those things, >> we need to change it, IMHO. >>> >> Right now Tika-related reader does not associate a given text fragment >> with the file name, so a function looking at some text and trying to >> find where it came from won't be able to do so. >> >> So I asked how to do it in Beam, how to attach some context to the given >> piece of data. I hope it can be done and if not - then perhaps some >> improvement can be applied. >> >> Re the unordered text - yes - this is what we currently have with Beam + >> TikaIO :-). >> >> The use-case I referred to earlier in this thread (upload PDFs - save >> the possibly unordered text to Lucene with the file name 'attached', let >> users search for the files containing some words - phrases, this works >> OK given that I can see PDF parser for ex reporting the lines) can be >> supported OK with the current TikaIO (provided we find a way to 'attach' >> a file name to the flow). >> >> I see though supporting the total ordering can be a big deal in other >> cases. Eugene, can you please explain how it can be done, is it >> achievable in principle, without the users having to do some custom >> coding ? >> >>> To the question of -- why is this in Beam at all; why don't we let users >> call it if they want it?... >>> >>> No matter how much we do to Tika, it will behave badly sometimes -- >> permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks >> using Beam -- folks likely with large batches of unruly/noisy documents -- >> are more likely to run into these problems than your average >> couple-of-thousand-docs-from-our-own-company user. So, if there are things >> we can do in Beam to prevent developers around the world from having to >> reinvent the wheel for defenses against these problems, then I'd be >> enormously grateful if we could put Tika into Beam. That means: >>> >>> 1) a process-level timeout (because you can't actually kill a thread in >> Java) >>> 2) a process-level restart on OOM >>> 3) avoid trying to reprocess a badly behaving document >>> >>> If Beam automatically handles those problems, then I'd say, y, let users >> write their own code. If there is so much as a single configuration knob >> (and it sounds like Beam is against complex configuration...yay!) to get >> that working in Beam, then I'd say, please integrate Tika into Beam. From >> a safety perspective, it is critical to keep the extraction process >> entirely separate (jvm, vm, m, rack, data center!) from the >> transformation+loading steps. IMHO, very few devs realize this because >> Tika works well lots of the time...which is why it is critical for us to >> make it easy for people to get it right all of the time. >>> >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch >> mode first in one jvm, and then I kick off another process to do >> transform/loading into Lucene/Solr from the .json files that Tika generates >> for each input file. If I were to scale up, I'd want to maintain this >> complete separation of steps. >>> >>> Apologies if I've derailed the conversation or misunderstood this thread. >>> >> Major thanks for your input :-) >> >> Cheers, Sergey >> >>> Cheers, >>> >>> Tim >>> >>> -----Original Message----- >>> From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] >>> Sent: Thursday, September 21, 2017 9:07 AM >>> To: dev@beam.apache.org >>> Cc: Allison, Timothy B. <talli...@mitre.org> >>> Subject: Re: TikaIO concerns >>> >>> Hi All >>> >>> Please welcome Tim, one of Apache Tika leads and practitioners. >>> >>> Tim, thanks for joining in :-). If you have some great Apache Tika >> stories to share (preferably involving the cases where it did not really >> matter the ordering in which Tika-produced data were dealt with by the >>> consumers) then please do so :-). >>> >>> At the moment, even though Tika ContentHandler will emit the ordered >> data, the Beam runtime will have no guarantees that the downstream pipeline >> components will see the data coming in the right order. >>> >>> (FYI, I understand from the earlier comments that the total ordering is >> also achievable but would require the extra API support) >>> >>> Other comments would be welcome too >>> >>> Thanks, Sergey >>> >>> On 21/09/17 10:55, Sergey Beryozkin wrote: >>>> I noticed that the PDF and ODT parsers actually split by lines, not >>>> individual words and nearly 100% sure I saw Tika reporting individual >>>> lines when it was parsing the text files. The 'min text length' >>>> feature can help with reporting several lines at a time, etc... >>>> >>>> I'm working with this PDF all the time: >>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf >>>> >>>> try it too if you get a chance. >>>> >>>> (and I can imagine not all PDFs/etc representing the 'story' but can >>>> be for ex a log-like content too) >>>> >>>> That said, I don't know how a parser for the format N will behave, it >>>> depends on the individual parsers. >>>> >>>> IMHO it's an equal candidate alongside Text-based bounded IOs... >>>> >>>> I'd like to know though how to make a file name available to the >>>> pipeline which is working with the current text fragment ? >>>> >>>> Going to try and do some measurements and compare the sync vs async >>>> parsing modes... >>>> >>>> Asked the Tika team to support with some more examples... >>>> >>>> Cheers, Sergey >>>> On 20/09/17 22:17, Sergey Beryozkin wrote: >>>>> Hi, >>>>> >>>>> thanks for the explanations, >>>>> >>>>> On 20/09/17 16:41, Eugene Kirpichov wrote: >>>>>> Hi! >>>>>> >>>>>> TextIO returns an unordered soup of lines contained in all files you >>>>>> ask it to read. People usually use TextIO for reading files where 1 >>>>>> line corresponds to 1 independent data element, e.g. a log entry, or >>>>>> a row of a CSV file - so discarding order is ok. >>>>> Just a side note, I'd probably want that be ordered, though I guess >>>>> it depends... >>>>>> However, there is a number of cases where TextIO is a poor fit: >>>>>> - Cases where discarding order is not ok - e.g. if you're doing >>>>>> natural language processing and the text files contain actual prose, >>>>>> where you need to process a file as a whole. TextIO can't do that. >>>>>> - Cases where you need to remember which file each element came >>>>>> from, e.g. >>>>>> if you're creating a search index for the files: TextIO can't do >>>>>> this either. >>>>>> >>>>>> Both of these issues have been raised in the past against TextIO; >>>>>> however it seems that the overwhelming majority of users of TextIO >>>>>> use it for logs or CSV files or alike, so solving these issues has >>>>>> not been a priority. >>>>>> Currently they are solved in a general form via FileIO.read() which >>>>>> gives you access to reading a full file yourself - people who want >>>>>> more flexibility will be able to use standard Java text-parsing >>>>>> utilities on a ReadableFile, without involving TextIO. >>>>>> >>>>>> Same applies for XmlIO: it is specifically designed for the narrow >>>>>> use case where the files contain independent data entries, so >>>>>> returning an unordered soup of them, with no association to the >>>>>> original file, is the user's intention. XmlIO will not work for >>>>>> processing more complex XML files that are not simply a sequence of >>>>>> entries with the same tag, and it also does not remember the >>>>>> original filename. >>>>>> >>>>> >>>>> OK... >>>>> >>>>>> However, if my understanding of Tika use cases is correct, it is >>>>>> mainly used for extracting content from complex file formats - for >>>>>> example, extracting text and images from PDF files or Word >>>>>> documents. I believe this is the main difference between it and >>>>>> TextIO - people usually use Tika for complex use cases where the >>>>>> "unordered soup of stuff" abstraction is not useful. >>>>>> >>>>>> My suspicion about this is confirmed by the fact that the crux of >>>>>> the Tika API is ContentHandler >>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler. >>>>>> html?is-external=true >>>>>> >>>>>> whose >>>>>> documentation says "The order of events in this interface is very >>>>>> important, and mirrors the order of information in the document >> itself." >>>>> All that says is that a (Tika) ContentHandler will be a true SAX >>>>> ContentHandler... >>>>>> >>>>>> Let me give a few examples of what I think is possible with the raw >>>>>> Tika API, but I think is not currently possible with TikaIO - please >>>>>> correct me where I'm wrong, because I'm not particularly familiar >>>>>> with Tika and am judging just based on what I read about it. >>>>>> - User has 100,000 Word documents and wants to convert each of them >>>>>> to text files for future natural language processing. >>>>>> - User has 100,000 PDF files with financial statements, each >>>>>> containing a bunch of unrelated text and - the main content - a list >>>>>> of transactions in PDF tables. User wants to extract each >>>>>> transaction as a PCollection element, discarding the unrelated text. >>>>>> - User has 100,000 PDF files with scientific papers, and wants to >>>>>> extract text from them, somehow parse author and affiliation from >>>>>> the text, and compute statistics of topics and terminology usage by >>>>>> author name and affiliation. >>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras >>>>>> observing a location over time: they want to extract metadata from >>>>>> each image using Tika, analyze the images themselves using some >>>>>> other library, and detect anomalies in the overall appearance of the >>>>>> location over time as seen from multiple cameras. >>>>>> I believe all of these cases can not be solved with TikaIO because >>>>>> the resulting PCollection<String> contains no information about >>>>>> which String comes from which document and about the order in which >>>>>> they appear in the document. >>>>> These are good use cases, thanks... I thought what you were talking >>>>> about the unordered soup of data produced by TikaIO (and its friends >>>>> TextIO and alike :-)). >>>>> Putting the ordered vs unordered question aside for a sec, why >>>>> exactly a Tika Reader can not make the name of the file it's >>>>> currently reading from available to the pipeline, as some Beam >> pipeline metadata piece ? >>>>> Surely it can be possible with Beam ? If not then I would be >> surprised... >>>>> >>>>>> >>>>>> I am, honestly, struggling to think of a case where I would want to >>>>>> use Tika, but where I *would* be ok with getting an unordered soup >>>>>> of strings. >>>>>> So some examples would be very helpful. >>>>>> >>>>> Yes. I'll ask Tika developers to help with some examples, but I'll >>>>> give one example where it did not matter to us in what order >>>>> Tika-produced data were available to the downstream layer. >>>>> >>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache >>>>> Con NAs, and we had a happy audience: >>>>> >>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea >>>>> se/samples/jax_rs/search >>>>> >>>>> >>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put >>>>> into Lucene. We associate a file name with the indexed content and >>>>> then let users find a list of PDF files which contain a given word or >>>>> few words, details are here >>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea >>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal >>>>> og.java#L131 >>>>> >>>>> >>>>> I'd say even more involved search engines would not mind supporting a >>>>> case like that :-) >>>>> >>>>> Now there we process one file at a time, and I understand now that >>>>> with TikaIO and N files it's all over the place really as far as the >>>>> ordering is concerned, which file it's coming from. etc. That's why >>>>> TikaReader must be able to associate the file name with a given piece >>>>> of text it's making available to the pipeline. >>>>> >>>>> I'd be happy to support the ParDo way of linking Tika with Beam. >>>>> If it makes things simpler then it would be good, I've just no idea >>>>> at the moment how to start the pipeline without using a >>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned >>>>> earlier - how can one avoid it with ParDo when implementing a 'min >>>>> len chunk' feature, where the ParDo would have to concatenate several >>>>> SAX data pieces first before making a single composite piece to the >> pipeline ? >>>>> >>>>> >>>>>> Another way to state it: currently, if I wanted to solve all of the >>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika >>>>>> API myself on the resulting ReadableFile. How can we make TikaIO >>>>>> provide a usability improvement over such usage? >>>>>> >>>>> >>>>> >>>>> If you are actually asking, does it really make sense for Beam to >>>>> ship Tika related code, given that users can just do it themselves, >>>>> I'm not sure. >>>>> >>>>> IMHO it always works better if users have to provide just few config >>>>> options to an integral part of the framework and see things happening. >>>>> It will bring more users. >>>>> >>>>> Whether the current Tika code (refactored or not) stays with Beam or >>>>> not - I'll let you and the team decide; believe it or not I was >>>>> seriously contemplating at the last moment to make it all part of the >>>>> Tika project itself and have a bit more flexibility over there with >>>>> tweaking things, but now that it is in the Beam snapshot - I don't >>>>> know - it's no my decision... >>>>> >>>>>> I am confused by your other comment - "Does the ordering matter ? >>>>>> Perhaps >>>>>> for some cases it does, and for some it does not. May be it makes >>>>>> sense to support running TikaIO as both the bounded reader/source >>>>>> and ParDo, with getting the common code reused." - because using >>>>>> BoundedReader or ParDo is not related to the ordering issue, only to >>>>>> the issue of asynchronous reading and complexity of implementation. >>>>>> The resulting PCollection will be unordered either way - this needs >>>>>> to be solved separately by providing a different API. >>>>> Right I see now, so ParDo is not about making Tika reported data >>>>> available to the downstream pipeline components ordered, only about >>>>> the simpler implementation. >>>>> Association with the file should be possible I hope, but I understand >>>>> it would be possible to optionally make the data coming out in the >>>>> ordered way as well... >>>>> >>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo, >>>>> let me double check: should we still give some thought to the >>>>> possible performance benefit of the current approach ? As I said, I >>>>> can easily get rid of all that polling code, use a simple Blocking >> queue. >>>>> >>>>> Cheers, Sergey >>>>>> >>>>>> Thanks. >>>>>> >>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin >>>>>> <sberyoz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing >>>>>>> we both agree upon is that Tika can help Beam in its own unique way. >>>>>>> >>>>>>> Before trying to reply online, I'd like to state that my main >>>>>>> assumption is that TikaIO (as far as the read side is concerned) is >>>>>>> no different to Text, XML or similar bounded reader components. >>>>>>> >>>>>>> I have to admit I don't understand your questions about TikaIO >>>>>>> usecases. >>>>>>> >>>>>>> What are the Text Input or XML input use-cases ? These use cases >>>>>>> are TikaInput cases as well, the only difference is Tika can not >>>>>>> split the individual file into a sequence of sources/etc, >>>>>>> >>>>>>> TextIO can read from the plain text files (possibly zipped), XML - >>>>>>> optimized around reading from the XML files, and I thought I made >>>>>>> it clear (and it is a known fact anyway) Tika was about reading >>>>>>> basically from any file format. >>>>>>> >>>>>>> Where is the difference (apart from what I've already mentioned) ? >>>>>>> >>>>>>> Sergey >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Replies inline. >>>>>>>> >>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin >>>>>>>> <sberyoz...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi All >>>>>>>>> >>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a >>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really >>>>>>>>> great to try and link both projects together, which led me to >>>>>>>>> opening [1] where I typed some early thoughts, followed by PR >>>>>>>>> [2]. >>>>>>>>> >>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) >>>>>>>>> newer review comments from Eugene pending, so I'd like to >>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then >>>>>>>>> decide, based on the feedback from the experts, what to do next. >>>>>>>>> >>>>>>>>> Apache Tika Parsers report the text content in chunks, via >>>>>>>>> SaxParser events. It's not possible with Tika to take a file and >>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line >>>>>>>>> by line, the only way is to handle the SAXParser callbacks which >>>>>>>>> report the data chunks. >>>>>>>>> Some >>>>>>>>> parsers may report the complete lines, some individual words, >>>>>>>>> with some being able report the data only after the completely >>>>>>>>> parse the document. >>>>>>>>> All depends on the data format. >>>>>>>>> >>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads >>>>>>>>> to parse the files, Beam threads will only collect the data from >>>>>>>>> the internal queue where the internal TikaReader's thread will >>>>>>>>> put the data into (note the data chunks are ordered even though >>>>>>>>> the tests might suggest otherwise). >>>>>>>>> >>>>>>>> I agree that your implementation of reader returns records in >>>>>>>> order >>>>>>>> - but >>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about >>>>>>>> the order in which records are produced by a BoundedReader - the >>>>>>>> order produced by your reader is ignored, and when applying any >>>>>>>> transforms to the >>>>>>> PCollection >>>>>>>> produced by TikaIO, it is impossible to recover the order in which >>>>>>>> your reader returned the records. >>>>>>>> >>>>>>>> With that in mind, is PCollection<String>, containing individual >>>>>>>> Tika-detected items, still the right API for representing the >>>>>>>> result of parsing a large number of documents with Tika? >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> The reason I did it was because I thought >>>>>>>>> >>>>>>>>> 1) it would make the individual data chunks available faster to >>>>>>>>> the pipeline - the parser will continue working via the >>>>>>>>> binary/video etc file while the data will already start flowing - >>>>>>>>> I agree there should be some tests data available confirming it - >>>>>>>>> but I'm positive at the moment this approach might yield some >>>>>>>>> performance gains with the large sets. If the file is large, if >>>>>>>>> it has the embedded attachments/videos to deal with, then it may >>>>>>>>> be more effective not to get the Beam thread deal with it... >>>>>>>>> >>>>>>>>> As I said on the PR, this description contains unfounded and >>>>>>>>> potentially >>>>>>>> incorrect assumptions about how Beam runners execute (or may >>>>>>>> execute in >>>>>>> the >>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand >>>>>>> correctly, >>>>>>>> you might be assuming that: >>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to >>>>>>> complete >>>>>>>> before processing its outputs with downstream transforms >>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo >>>>>>> *concurrently* >>>>>>>> with downstream processing of its results >>>>>>>> - Passing an element from one thread to another using a >>>>>>>> BlockingQueue is free in terms of performance All of these are >>>>>>>> false at least in some runners, and I'm almost certain that in >>>>>>>> reality, performance of this approach is worse than a ParDo in >>>>>>> most >>>>>>>> production runners. >>>>>>>> >>>>>>>> There are other disadvantages to this approach: >>>>>>>> - Doing the bulk of the processing in a separate thread makes it >>>>>>> invisible >>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform >>>>>>>> profiling capabilities, or the ability to get the current stack >>>>>>>> trace for stuck elements, this approach would make the real >>>>>>>> processing invisible to all of these capabilities, and a user >>>>>>>> would only see that the bulk of the time is spent waiting for the >>>>>>>> next element, but not *why* the next >>>>>>> element >>>>>>>> is taking long to compute. >>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread, >>>>>>>> invisible to Beam, will make it harder for runners to do >>>>>>>> autoscaling, binpacking >>>>>>> and >>>>>>>> other resource management magic (how much of this runners actually >>>>>>>> do is >>>>>>> a >>>>>>>> separate issue), because the runner will have no way of knowing >>>>>>>> how much CPU/IO this particular transform is actually using - all >>>>>>>> the processing happens in a thread about which the runner is >>>>>>>> unaware. >>>>>>>> - As far as I can tell, the code also hides exceptions that happen >>>>>>>> in the Tika thread >>>>>>>> - Adding the thread management makes the code much more complex, >>>>>>>> easier >>>>>>> to >>>>>>>> introduce bugs, and harder for others to contribute >>>>>>>> >>>>>>>> >>>>>>>>> 2) As I commented at the end of [2], having an option to >>>>>>>>> concatenate the data chunks first before making them available to >>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would >>>>>>>>> introduce some synchronization issues (though not exactly sure >>>>>>>>> yet) >>>>>>>>> >>>>>>>> What are these issues? >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> One of valid concerns there is that the reader is polling the >>>>>>>>> internal queue so, in theory at least, and perhaps in some rare >>>>>>>>> cases too, we may have a case where the max polling time has been >>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all >>>>>>>>> the file data. I think that it can be solved by either 2a) >>>>>>>>> configuring the max polling time to a very large number which >>>>>>>>> will never be reached for a practical case, or >>>>>>>>> 2b) simply use a blocking queue without the time limits - in the >>>>>>>>> worst case, if TikaParser spins and fails to report the end of >>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks. >>>>>>>>> I propose to follow 2b). >>>>>>>>> >>>>>>>> I agree that there should be no way to unintentionally configure >>>>>>>> the transform in a way that will produce silent data loss. Another >>>>>>>> reason for not having these tuning knobs is that it goes against >>>>>>>> Beam's "no knobs" >>>>>>>> philosophy, and that in most cases users have no way of figuring >>>>>>>> out a >>>>>>> good >>>>>>>> value for tuning knobs except for manual experimentation, which is >>>>>>>> extremely brittle and typically gets immediately obsoleted by >>>>>>>> running on >>>>>>> a >>>>>>>> new dataset or updating a version of some of the involved >>>>>>>> dependencies >>>>>>> etc. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Please let me know what you think. >>>>>>>>> My plan so far is: >>>>>>>>> 1) start addressing most of Eugene's comments which would require >>>>>>>>> some minor TikaIO updates >>>>>>>>> 2) work on removing the TikaSource internal code dealing with >>>>>>>>> File patterns which I copied from TextIO at the next stage >>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam >>>>>>>>> users some time to try it with some real complex files and also >>>>>>>>> decide if TikaIO can continue implemented as a >>>>>>>>> BoundedSource/Reader or not >>>>>>>>> >>>>>>>>> Eugene, all, will it work if I start with 1) ? >>>>>>>>> >>>>>>>> Yes, but I think we should start by discussing the anticipated use >>>>>>>> cases >>>>>>> of >>>>>>>> TikaIO and designing an API for it based on those use cases; and >>>>>>>> then see what's the best implementation for that particular API >>>>>>>> and set of anticipated use cases. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, Sergey >>>>>>>>> >>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328 >>>>>>>>> [2] https://github.com/apache/beam/pull/3378 >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >> > -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/