Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchOverview" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaBatchOverview?action=diff&rev1=5&rev2=6 ## page was renamed from TikaBatch - = Documentation for the planned tika-batch module = + = Documentation for the tika-batch module = - '''This module is in development and not yet available in trunk''' + This module is currently available in trunk and will be available in Tika 1.8. The code has been under development for a while, but there may be surprises. Please test early and often and submit issues when you find them. tika-batch is available as a standalone module, and it is integrated into tika-app. == The Need == William Palmer [[http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite|documents]] what many integrators of Tika face -- Tika works very well on most documents, but it can run into problems. As Nick Burch [[http://www.slideshare.net/gagravarr/1s-and-0s|notes]] on slides 47-55, even if Tika fails catastrophically on a small percentage of documents, a small percentage of a lot of documents is still a lot of documents, and "you need to plan for failures". Some types of catastrophic failures include: @@ -39, +39 @@ ==== StatusReporter ==== A !StatusReporter runs in a separate thread. It has visibility into the crawler and the consumers, and it periodically reports on how many files have been processed, how many exceptions, and so on. - '''StaleChecker''' + ==== StaleChecker ==== - This periodically checks for stale consumers. It will cause the !BatchProcess to shut down if it finds more than the allowed number of stale consumers. In practice, I've seen a single stale consumer tie up a huge amount of resources. I recommend allowing 0 stale consumers. + This is an inner class within !BatchProcess that periodically checks for stale consumers. It will cause the !BatchProcess to shut down if it finds a stale consumer. In earlier versions of the code, the user could specify the maximum number of stale threads before !BatchProcess shutdown, however, in practice, a single stale consumer can tie up a huge amount of resources. For now, !BatchProcess will shutdown if it finds one stale consumer. === ProcessDriver === - This initiates the !BatchProcess and monitors it to make sure that it is still alive when it should be. If the !BatchProcess sends a restart signal via stderr to the !ProcessDriver, the !ProcessDriver will restart the !BatchProcess. + This initiates the !BatchProcess and monitors it to make sure that it is still alive when it should be. If the !BatchProcess sends a restart signal via stderr or a restart exit code to the !ProcessDriver, the !ProcessDriver will restart the !BatchProcess. == File System (FS) Batch, Step 1 == The initial use case for tika-batch is to process a directory of files recursively and generate an output file for each input file. The output directory has the same structure/hierarchy as the input folder, and each output file has a file suffix appended to it depending on the !ContentHandler (".xml", ".txt", ".json", etc).
