Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaBatchOverview" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchOverview?action=diff&rev1=5&rev2=6

  ## page was renamed from TikaBatch
- = Documentation for the planned tika-batch module =
+ = Documentation for the tika-batch module =
- '''This module is in development and not yet available in trunk'''
+ This module is currently available in trunk and will be available in Tika 
1.8.  The code has been under development for a while, but there may be 
surprises.  Please test early and often and submit issues when you find them.  
tika-batch is available as a standalone module, and it is integrated into 
tika-app.
  
  == The Need ==
  William Palmer 
[[http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite|documents]]
 what many integrators of Tika face -- Tika works very well on most documents, 
but it can run into problems. As Nick Burch 
[[http://www.slideshare.net/gagravarr/1s-and-0s|notes]] on slides 47-55, even 
if Tika fails catastrophically on a small percentage of documents, a small 
percentage of a lot of documents is still a lot of documents, and "you need to 
plan for failures".  Some types of catastrophic failures include:
@@ -39, +39 @@

  ==== StatusReporter ====
  A !StatusReporter runs in a separate thread.  It has visibility into the 
crawler and the consumers, and it periodically reports on how many files have 
been processed, how many exceptions, and so on.
  
- '''StaleChecker'''
+ ==== StaleChecker ====
  
- This periodically checks for stale consumers.  It will cause the 
!BatchProcess to shut down if it finds more than the allowed number of stale 
consumers.  In practice, I've seen a single stale consumer tie up a huge amount 
of resources.  I recommend allowing 0 stale consumers.
+ This is an inner class within !BatchProcess that periodically checks for 
stale consumers.  It will cause the !BatchProcess to shut down if it finds a 
stale consumer.  In earlier versions of the code, the user could specify the 
maximum number of stale threads before !BatchProcess shutdown, however, in 
practice, a single stale consumer can tie up a huge amount of resources.  For 
now, !BatchProcess will shutdown if it finds one stale consumer.
  
  === ProcessDriver ===
- This initiates the !BatchProcess and monitors it to make sure that it is 
still alive when it should be.  If the !BatchProcess sends a restart signal via 
stderr to the !ProcessDriver, the !ProcessDriver will restart the !BatchProcess.
+ This initiates the !BatchProcess and monitors it to make sure that it is 
still alive when it should be.  If the !BatchProcess sends a restart signal via 
stderr or a restart exit code to the !ProcessDriver, the !ProcessDriver will 
restart the !BatchProcess.
  
  == File System (FS) Batch, Step 1 ==
  The initial use case for tika-batch is to process a directory of files 
recursively and generate an output file for each input file.  The output 
directory has the same structure/hierarchy as the input folder, and each output 
file has a file suffix appended to it depending on the !ContentHandler (".xml", 
".txt", ".json", etc).

Reply via email to