Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.
*Association with original file:* Sergey - Beam does not *automatically* provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E . If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model). *Order within a file:* The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element. Given both of these, I think that it's not possible to create a *general-purpose* TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches(). However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this. E.g. TikaIO could: - take as input a PCollection<ReadableFile> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata } - be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings(). Example usage would look like: PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...)) .apply(FileIO.readMatches()) .apply(TikaIO.parseAllAsStrings()) or: .apply(TikaIO.parseAll() .withParser(new AutoDetectParser()) .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler()))) You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example: p.apply(TikaIO.parseAsStrings().from(filepattern)) This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse. On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com> wrote: > Hi Tim > On 21/09/17 14:33, Allison, Timothy B. wrote: > > Thank you, Sergey. > > > > My knowledge of Apache Beam is limited -- I saw Davor and > Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally > impressed, but I haven't had a chance to work with it yet. > > > > From my perspective, if I understand this thread (and I may not!), > getting unordered text from _a given file_ is a non-starter for most > applications. The implementation needs to guarantee order per file, and > the user has to be able to link the "extract" back to a unique identifier > for the document. If the current implementation doesn't do those things, > we need to change it, IMHO. > > > Right now Tika-related reader does not associate a given text fragment > with the file name, so a function looking at some text and trying to > find where it came from won't be able to do so. > > So I asked how to do it in Beam, how to attach some context to the given > piece of data. I hope it can be done and if not - then perhaps some > improvement can be applied. > > Re the unordered text - yes - this is what we currently have with Beam + > TikaIO :-). > > The use-case I referred to earlier in this thread (upload PDFs - save > the possibly unordered text to Lucene with the file name 'attached', let > users search for the files containing some words - phrases, this works > OK given that I can see PDF parser for ex reporting the lines) can be > supported OK with the current TikaIO (provided we find a way to 'attach' > a file name to the flow). > > I see though supporting the total ordering can be a big deal in other > cases. Eugene, can you please explain how it can be done, is it > achievable in principle, without the users having to do some custom > coding ? > > > To the question of -- why is this in Beam at all; why don't we let users > call it if they want it?... > > > > No matter how much we do to Tika, it will behave badly sometimes -- > permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks > using Beam -- folks likely with large batches of unruly/noisy documents -- > are more likely to run into these problems than your average > couple-of-thousand-docs-from-our-own-company user. So, if there are things > we can do in Beam to prevent developers around the world from having to > reinvent the wheel for defenses against these problems, then I'd be > enormously grateful if we could put Tika into Beam. That means: > > > > 1) a process-level timeout (because you can't actually kill a thread in > Java) > > 2) a process-level restart on OOM > > 3) avoid trying to reprocess a badly behaving document > > > > If Beam automatically handles those problems, then I'd say, y, let users > write their own code. If there is so much as a single configuration knob > (and it sounds like Beam is against complex configuration...yay!) to get > that working in Beam, then I'd say, please integrate Tika into Beam. From > a safety perspective, it is critical to keep the extraction process > entirely separate (jvm, vm, m, rack, data center!) from the > transformation+loading steps. IMHO, very few devs realize this because > Tika works well lots of the time...which is why it is critical for us to > make it easy for people to get it right all of the time. > > > > Even in my desktop (gah, y, desktop!) search app, I run Tika in batch > mode first in one jvm, and then I kick off another process to do > transform/loading into Lucene/Solr from the .json files that Tika generates > for each input file. If I were to scale up, I'd want to maintain this > complete separation of steps. > > > > Apologies if I've derailed the conversation or misunderstood this thread. > > > Major thanks for your input :-) > > Cheers, Sergey > > > Cheers, > > > > Tim > > > > -----Original Message----- > > From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] > > Sent: Thursday, September 21, 2017 9:07 AM > > To: dev@beam.apache.org > > Cc: Allison, Timothy B. <talli...@mitre.org> > > Subject: Re: TikaIO concerns > > > > Hi All > > > > Please welcome Tim, one of Apache Tika leads and practitioners. > > > > Tim, thanks for joining in :-). If you have some great Apache Tika > stories to share (preferably involving the cases where it did not really > matter the ordering in which Tika-produced data were dealt with by the > > consumers) then please do so :-). > > > > At the moment, even though Tika ContentHandler will emit the ordered > data, the Beam runtime will have no guarantees that the downstream pipeline > components will see the data coming in the right order. > > > > (FYI, I understand from the earlier comments that the total ordering is > also achievable but would require the extra API support) > > > > Other comments would be welcome too > > > > Thanks, Sergey > > > > On 21/09/17 10:55, Sergey Beryozkin wrote: > >> I noticed that the PDF and ODT parsers actually split by lines, not > >> individual words and nearly 100% sure I saw Tika reporting individual > >> lines when it was parsing the text files. The 'min text length' > >> feature can help with reporting several lines at a time, etc... > >> > >> I'm working with this PDF all the time: > >> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf > >> > >> try it too if you get a chance. > >> > >> (and I can imagine not all PDFs/etc representing the 'story' but can > >> be for ex a log-like content too) > >> > >> That said, I don't know how a parser for the format N will behave, it > >> depends on the individual parsers. > >> > >> IMHO it's an equal candidate alongside Text-based bounded IOs... > >> > >> I'd like to know though how to make a file name available to the > >> pipeline which is working with the current text fragment ? > >> > >> Going to try and do some measurements and compare the sync vs async > >> parsing modes... > >> > >> Asked the Tika team to support with some more examples... > >> > >> Cheers, Sergey > >> On 20/09/17 22:17, Sergey Beryozkin wrote: > >>> Hi, > >>> > >>> thanks for the explanations, > >>> > >>> On 20/09/17 16:41, Eugene Kirpichov wrote: > >>>> Hi! > >>>> > >>>> TextIO returns an unordered soup of lines contained in all files you > >>>> ask it to read. People usually use TextIO for reading files where 1 > >>>> line corresponds to 1 independent data element, e.g. a log entry, or > >>>> a row of a CSV file - so discarding order is ok. > >>> Just a side note, I'd probably want that be ordered, though I guess > >>> it depends... > >>>> However, there is a number of cases where TextIO is a poor fit: > >>>> - Cases where discarding order is not ok - e.g. if you're doing > >>>> natural language processing and the text files contain actual prose, > >>>> where you need to process a file as a whole. TextIO can't do that. > >>>> - Cases where you need to remember which file each element came > >>>> from, e.g. > >>>> if you're creating a search index for the files: TextIO can't do > >>>> this either. > >>>> > >>>> Both of these issues have been raised in the past against TextIO; > >>>> however it seems that the overwhelming majority of users of TextIO > >>>> use it for logs or CSV files or alike, so solving these issues has > >>>> not been a priority. > >>>> Currently they are solved in a general form via FileIO.read() which > >>>> gives you access to reading a full file yourself - people who want > >>>> more flexibility will be able to use standard Java text-parsing > >>>> utilities on a ReadableFile, without involving TextIO. > >>>> > >>>> Same applies for XmlIO: it is specifically designed for the narrow > >>>> use case where the files contain independent data entries, so > >>>> returning an unordered soup of them, with no association to the > >>>> original file, is the user's intention. XmlIO will not work for > >>>> processing more complex XML files that are not simply a sequence of > >>>> entries with the same tag, and it also does not remember the > >>>> original filename. > >>>> > >>> > >>> OK... > >>> > >>>> However, if my understanding of Tika use cases is correct, it is > >>>> mainly used for extracting content from complex file formats - for > >>>> example, extracting text and images from PDF files or Word > >>>> documents. I believe this is the main difference between it and > >>>> TextIO - people usually use Tika for complex use cases where the > >>>> "unordered soup of stuff" abstraction is not useful. > >>>> > >>>> My suspicion about this is confirmed by the fact that the crux of > >>>> the Tika API is ContentHandler > >>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler. > >>>> html?is-external=true > >>>> > >>>> whose > >>>> documentation says "The order of events in this interface is very > >>>> important, and mirrors the order of information in the document > itself." > >>> All that says is that a (Tika) ContentHandler will be a true SAX > >>> ContentHandler... > >>>> > >>>> Let me give a few examples of what I think is possible with the raw > >>>> Tika API, but I think is not currently possible with TikaIO - please > >>>> correct me where I'm wrong, because I'm not particularly familiar > >>>> with Tika and am judging just based on what I read about it. > >>>> - User has 100,000 Word documents and wants to convert each of them > >>>> to text files for future natural language processing. > >>>> - User has 100,000 PDF files with financial statements, each > >>>> containing a bunch of unrelated text and - the main content - a list > >>>> of transactions in PDF tables. User wants to extract each > >>>> transaction as a PCollection element, discarding the unrelated text. > >>>> - User has 100,000 PDF files with scientific papers, and wants to > >>>> extract text from them, somehow parse author and affiliation from > >>>> the text, and compute statistics of topics and terminology usage by > >>>> author name and affiliation. > >>>> - User has 100,000 photos in JPEG made by a set of automatic cameras > >>>> observing a location over time: they want to extract metadata from > >>>> each image using Tika, analyze the images themselves using some > >>>> other library, and detect anomalies in the overall appearance of the > >>>> location over time as seen from multiple cameras. > >>>> I believe all of these cases can not be solved with TikaIO because > >>>> the resulting PCollection<String> contains no information about > >>>> which String comes from which document and about the order in which > >>>> they appear in the document. > >>> These are good use cases, thanks... I thought what you were talking > >>> about the unordered soup of data produced by TikaIO (and its friends > >>> TextIO and alike :-)). > >>> Putting the ordered vs unordered question aside for a sec, why > >>> exactly a Tika Reader can not make the name of the file it's > >>> currently reading from available to the pipeline, as some Beam > pipeline metadata piece ? > >>> Surely it can be possible with Beam ? If not then I would be > surprised... > >>> > >>>> > >>>> I am, honestly, struggling to think of a case where I would want to > >>>> use Tika, but where I *would* be ok with getting an unordered soup > >>>> of strings. > >>>> So some examples would be very helpful. > >>>> > >>> Yes. I'll ask Tika developers to help with some examples, but I'll > >>> give one example where it did not matter to us in what order > >>> Tika-produced data were available to the downstream layer. > >>> > >>> It's a demo the Apache CXF colleague of mine showed at one of Apache > >>> Con NAs, and we had a happy audience: > >>> > >>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea > >>> se/samples/jax_rs/search > >>> > >>> > >>> PDF or ODT files uploaded, Tika parses them, and all of that is put > >>> into Lucene. We associate a file name with the indexed content and > >>> then let users find a list of PDF files which contain a given word or > >>> few words, details are here > >>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea > >>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal > >>> og.java#L131 > >>> > >>> > >>> I'd say even more involved search engines would not mind supporting a > >>> case like that :-) > >>> > >>> Now there we process one file at a time, and I understand now that > >>> with TikaIO and N files it's all over the place really as far as the > >>> ordering is concerned, which file it's coming from. etc. That's why > >>> TikaReader must be able to associate the file name with a given piece > >>> of text it's making available to the pipeline. > >>> > >>> I'd be happy to support the ParDo way of linking Tika with Beam. > >>> If it makes things simpler then it would be good, I've just no idea > >>> at the moment how to start the pipeline without using a > >>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned > >>> earlier - how can one avoid it with ParDo when implementing a 'min > >>> len chunk' feature, where the ParDo would have to concatenate several > >>> SAX data pieces first before making a single composite piece to the > pipeline ? > >>> > >>> > >>>> Another way to state it: currently, if I wanted to solve all of the > >>>> use cases above, I'd just use FileIO.readMatches() and use the Tika > >>>> API myself on the resulting ReadableFile. How can we make TikaIO > >>>> provide a usability improvement over such usage? > >>>> > >>> > >>> > >>> If you are actually asking, does it really make sense for Beam to > >>> ship Tika related code, given that users can just do it themselves, > >>> I'm not sure. > >>> > >>> IMHO it always works better if users have to provide just few config > >>> options to an integral part of the framework and see things happening. > >>> It will bring more users. > >>> > >>> Whether the current Tika code (refactored or not) stays with Beam or > >>> not - I'll let you and the team decide; believe it or not I was > >>> seriously contemplating at the last moment to make it all part of the > >>> Tika project itself and have a bit more flexibility over there with > >>> tweaking things, but now that it is in the Beam snapshot - I don't > >>> know - it's no my decision... > >>> > >>>> I am confused by your other comment - "Does the ordering matter ? > >>>> Perhaps > >>>> for some cases it does, and for some it does not. May be it makes > >>>> sense to support running TikaIO as both the bounded reader/source > >>>> and ParDo, with getting the common code reused." - because using > >>>> BoundedReader or ParDo is not related to the ordering issue, only to > >>>> the issue of asynchronous reading and complexity of implementation. > >>>> The resulting PCollection will be unordered either way - this needs > >>>> to be solved separately by providing a different API. > >>> Right I see now, so ParDo is not about making Tika reported data > >>> available to the downstream pipeline components ordered, only about > >>> the simpler implementation. > >>> Association with the file should be possible I hope, but I understand > >>> it would be possible to optionally make the data coming out in the > >>> ordered way as well... > >>> > >>> Assuming TikaIO stays, and before trying to re-implement as ParDo, > >>> let me double check: should we still give some thought to the > >>> possible performance benefit of the current approach ? As I said, I > >>> can easily get rid of all that polling code, use a simple Blocking > queue. > >>> > >>> Cheers, Sergey > >>>> > >>>> Thanks. > >>>> > >>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin > >>>> <sberyoz...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi > >>>>> > >>>>> Glad TikaIO getting some serious attention :-), I believe one thing > >>>>> we both agree upon is that Tika can help Beam in its own unique way. > >>>>> > >>>>> Before trying to reply online, I'd like to state that my main > >>>>> assumption is that TikaIO (as far as the read side is concerned) is > >>>>> no different to Text, XML or similar bounded reader components. > >>>>> > >>>>> I have to admit I don't understand your questions about TikaIO > >>>>> usecases. > >>>>> > >>>>> What are the Text Input or XML input use-cases ? These use cases > >>>>> are TikaInput cases as well, the only difference is Tika can not > >>>>> split the individual file into a sequence of sources/etc, > >>>>> > >>>>> TextIO can read from the plain text files (possibly zipped), XML - > >>>>> optimized around reading from the XML files, and I thought I made > >>>>> it clear (and it is a known fact anyway) Tika was about reading > >>>>> basically from any file format. > >>>>> > >>>>> Where is the difference (apart from what I've already mentioned) ? > >>>>> > >>>>> Sergey > >>>>> > >>>>> > >>>>> > >>>>> On 19/09/17 23:29, Eugene Kirpichov wrote: > >>>>>> Hi, > >>>>>> > >>>>>> Replies inline. > >>>>>> > >>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin > >>>>>> <sberyoz...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi All > >>>>>>> > >>>>>>> This is my first post the the dev list, I work for Talend, I'm a > >>>>>>> Beam novice, Apache Tika fan, and thought it would be really > >>>>>>> great to try and link both projects together, which led me to > >>>>>>> opening [1] where I typed some early thoughts, followed by PR > >>>>>>> [2]. > >>>>>>> > >>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) > >>>>>>> newer review comments from Eugene pending, so I'd like to > >>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then > >>>>>>> decide, based on the feedback from the experts, what to do next. > >>>>>>> > >>>>>>> Apache Tika Parsers report the text content in chunks, via > >>>>>>> SaxParser events. It's not possible with Tika to take a file and > >>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line > >>>>>>> by line, the only way is to handle the SAXParser callbacks which > >>>>>>> report the data chunks. > >>>>>>> Some > >>>>>>> parsers may report the complete lines, some individual words, > >>>>>>> with some being able report the data only after the completely > >>>>>>> parse the document. > >>>>>>> All depends on the data format. > >>>>>>> > >>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads > >>>>>>> to parse the files, Beam threads will only collect the data from > >>>>>>> the internal queue where the internal TikaReader's thread will > >>>>>>> put the data into (note the data chunks are ordered even though > >>>>>>> the tests might suggest otherwise). > >>>>>>> > >>>>>> I agree that your implementation of reader returns records in > >>>>>> order > >>>>>> - but > >>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about > >>>>>> the order in which records are produced by a BoundedReader - the > >>>>>> order produced by your reader is ignored, and when applying any > >>>>>> transforms to the > >>>>> PCollection > >>>>>> produced by TikaIO, it is impossible to recover the order in which > >>>>>> your reader returned the records. > >>>>>> > >>>>>> With that in mind, is PCollection<String>, containing individual > >>>>>> Tika-detected items, still the right API for representing the > >>>>>> result of parsing a large number of documents with Tika? > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> The reason I did it was because I thought > >>>>>>> > >>>>>>> 1) it would make the individual data chunks available faster to > >>>>>>> the pipeline - the parser will continue working via the > >>>>>>> binary/video etc file while the data will already start flowing - > >>>>>>> I agree there should be some tests data available confirming it - > >>>>>>> but I'm positive at the moment this approach might yield some > >>>>>>> performance gains with the large sets. If the file is large, if > >>>>>>> it has the embedded attachments/videos to deal with, then it may > >>>>>>> be more effective not to get the Beam thread deal with it... > >>>>>>> > >>>>>>> As I said on the PR, this description contains unfounded and > >>>>>>> potentially > >>>>>> incorrect assumptions about how Beam runners execute (or may > >>>>>> execute in > >>>>> the > >>>>>> future) a ParDo or a BoundedReader. For example, if I understand > >>>>> correctly, > >>>>>> you might be assuming that: > >>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to > >>>>> complete > >>>>>> before processing its outputs with downstream transforms > >>>>>> - Beam runners can not run a @ProcessElement call of a ParDo > >>>>> *concurrently* > >>>>>> with downstream processing of its results > >>>>>> - Passing an element from one thread to another using a > >>>>>> BlockingQueue is free in terms of performance All of these are > >>>>>> false at least in some runners, and I'm almost certain that in > >>>>>> reality, performance of this approach is worse than a ParDo in > >>>>> most > >>>>>> production runners. > >>>>>> > >>>>>> There are other disadvantages to this approach: > >>>>>> - Doing the bulk of the processing in a separate thread makes it > >>>>> invisible > >>>>>> to Beam's instrumentation. If a Beam runner provided per-transform > >>>>>> profiling capabilities, or the ability to get the current stack > >>>>>> trace for stuck elements, this approach would make the real > >>>>>> processing invisible to all of these capabilities, and a user > >>>>>> would only see that the bulk of the time is spent waiting for the > >>>>>> next element, but not *why* the next > >>>>> element > >>>>>> is taking long to compute. > >>>>>> - Likewise, offloading all the CPU and IO to a separate thread, > >>>>>> invisible to Beam, will make it harder for runners to do > >>>>>> autoscaling, binpacking > >>>>> and > >>>>>> other resource management magic (how much of this runners actually > >>>>>> do is > >>>>> a > >>>>>> separate issue), because the runner will have no way of knowing > >>>>>> how much CPU/IO this particular transform is actually using - all > >>>>>> the processing happens in a thread about which the runner is > >>>>>> unaware. > >>>>>> - As far as I can tell, the code also hides exceptions that happen > >>>>>> in the Tika thread > >>>>>> - Adding the thread management makes the code much more complex, > >>>>>> easier > >>>>> to > >>>>>> introduce bugs, and harder for others to contribute > >>>>>> > >>>>>> > >>>>>>> 2) As I commented at the end of [2], having an option to > >>>>>>> concatenate the data chunks first before making them available to > >>>>>>> the pipeline is useful, and I guess doing the same in ParDo would > >>>>>>> introduce some synchronization issues (though not exactly sure > >>>>>>> yet) > >>>>>>> > >>>>>> What are these issues? > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> One of valid concerns there is that the reader is polling the > >>>>>>> internal queue so, in theory at least, and perhaps in some rare > >>>>>>> cases too, we may have a case where the max polling time has been > >>>>>>> reached, the parser is still busy, and TikaIO fails to report all > >>>>>>> the file data. I think that it can be solved by either 2a) > >>>>>>> configuring the max polling time to a very large number which > >>>>>>> will never be reached for a practical case, or > >>>>>>> 2b) simply use a blocking queue without the time limits - in the > >>>>>>> worst case, if TikaParser spins and fails to report the end of > >>>>>>> the document, then, Bean can heal itself if the pipeline blocks. > >>>>>>> I propose to follow 2b). > >>>>>>> > >>>>>> I agree that there should be no way to unintentionally configure > >>>>>> the transform in a way that will produce silent data loss. Another > >>>>>> reason for not having these tuning knobs is that it goes against > >>>>>> Beam's "no knobs" > >>>>>> philosophy, and that in most cases users have no way of figuring > >>>>>> out a > >>>>> good > >>>>>> value for tuning knobs except for manual experimentation, which is > >>>>>> extremely brittle and typically gets immediately obsoleted by > >>>>>> running on > >>>>> a > >>>>>> new dataset or updating a version of some of the involved > >>>>>> dependencies > >>>>> etc. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Please let me know what you think. > >>>>>>> My plan so far is: > >>>>>>> 1) start addressing most of Eugene's comments which would require > >>>>>>> some minor TikaIO updates > >>>>>>> 2) work on removing the TikaSource internal code dealing with > >>>>>>> File patterns which I copied from TextIO at the next stage > >>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam > >>>>>>> users some time to try it with some real complex files and also > >>>>>>> decide if TikaIO can continue implemented as a > >>>>>>> BoundedSource/Reader or not > >>>>>>> > >>>>>>> Eugene, all, will it work if I start with 1) ? > >>>>>>> > >>>>>> Yes, but I think we should start by discussing the anticipated use > >>>>>> cases > >>>>> of > >>>>>> TikaIO and designing an API for it based on those use cases; and > >>>>>> then see what's the best implementation for that particular API > >>>>>> and set of anticipated use cases. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Thanks, Sergey > >>>>>>> > >>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328 > >>>>>>> [2] https://github.com/apache/beam/pull/3378 > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> >