[Tika Wiki] Update of "TikaBatch" by TimothyAllison

Apache Wiki Thu, 25 Sep 2014 09:16:33 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaBatch" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatch?action=diff&rev1=1&rev2=2

  = Documentation for the planned tika-batch module =
- 
  '''This module is in development and not yet available in trunk'''
  
  == The Need ==
- William Palmer [[ 
http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite
 | documents ]] what many integrators of Tika face -- Tika works very well on 
most documents, but it can run into problems. As Nick Burch [[ 
http://www.slideshare.net/gagravarr/1s-and-0s | notes ]] on slides 47-55, even 
if Tika fails catastrophically on a small percentage of documents, a small 
percentage of a lot of documents is still a lot of documents, and "you need to 
plan for failures".  Some types of catastrophic failures include: 
+ William Palmer 
[[http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite|documents]]
 what many integrators of Tika face -- Tika works very well on most documents, 
but it can run into problems. As Nick Burch 
[[http://www.slideshare.net/gagravarr/1s-and-0s|notes]] on slides 47-55, even 
if Tika fails catastrophically on a small percentage of documents, a small 
percentage of a lot of documents is still a lot of documents, and "you need to 
plan for failures".  Some types of catastrophic failures include:
  
   * Permanent hangs (runaway parsers)
   * Out-of-Memory Errors
   * Memory leaks
  
- 
  Running Tika efficiently and making it robust against these problems are 
non-trivial issues, and it will be helpful to have a framework that the 
community can use, fix and improve so that each integrator doesn't have to 
reinvent these solutions.
  
  == The Basic Design ==
- The basic design of the tika-batch module is intended for conventional 
processing -- if anything can be reused/modified for hadoop, please contribute! 
 
+ The basic design of the tika-batch module is intended for conventional 
processing -- if anything can be reused/modified for hadoop, please contribute!
  
  The overall design is a producer/consumer design pattern.  The producer and 
consumers share an !ArrayBlockingQueue.
  
@@ -32, +30 @@

  Some other implementations of a !ResourceCrawler might include a directory 
listener or a database exporter.  Anything else?
  
  ==== ResourceConsumers ====
- 
  A !ResourceConsumer pulls a resource from the queue and consumes it. A 
resource should be a lightweight pointer to a file resource (not the actual 
bytes!), and it returns an !InputStream and a Metadata object.
  
  ==== Interrupter ====
- 
- An interrupter runs in a separate thread and allows users to ask the 
BatchProcess to shut down gracefully. 
+ An interrupter runs in a separate thread and allows users to ask the 
!BatchProcess to shut down gracefully.
  
  ==== StatusReporter ====
  A !StatusReporter runs in a separate thread.  It has visibility into the 
crawler and the consumers, and it periodically reports on how many files have 
been processed, how many exceptions, and so on.
  
+ '''StaleChecker'''
+ 
+ This periodically checks for stale consumers.  It will cause the 
!BatchProcess to shut down if it finds more than the allowed number of stale 
consumers.  In practice, I've seen a single stale consumer tie up a huge amount 
of resources.  I recommend allowing 0 stale consumers.
  
  === ProcessDriver ===
  This initiates the !BatchProcess and monitors it to make sure that it is 
still alive when it should be.  If the !BatchProcess sends a restart signal via 
stderr to the !ProcessDriver, the !ProcessDriver will restart the !BatchProcess.
  
- 
  == File System (FS) Batch, Step 1 ==
- The initial use case for tika-batch is to process a directory of files 
recursively and generate an output file for each input file.  The output 
directory has the same structure/hierarchy as the input folder, and each output 
file has a file suffix appended to it depending on the ContentHandler (".xml", 
".txt", ".json", etc).
+ The initial use case for tika-batch is to process a directory of files 
recursively and generate an output file for each input file.  The output 
directory has the same structure/hierarchy as the input folder, and each output 
file has a file suffix appended to it depending on the !ContentHandler (".xml", 
".txt", ".json", etc).
  
  === FS Resource Crawlers ===
  For FSBatch, the directory crawler starts with a root directory and crawls 
all files.  Bells and whistles include:
+ 
   * The directory crawler can start with a root directory and a file list, and 
it will "crawl" all of the files on the list relative to the root directory.  
This is very useful for testing or for processing a subset of documents.
   * The directory crawler uses a Tika !DocumentSelector to determine whether 
or not to add a file to the queue.  The only metadata available to the 
directory crawler at this point in the processing is the file name and the 
length of the file in bytes.  The user can specify a regex for files to include 
(based on filename) and a regex for files to exclude (based on filename); the 
user can also specify a max bytes limit.
   * The directory crawler also has the idea of a start directory.  This is a 
child directory of the root directory.  This allows users another way to 
process a subset of a directory structure.
  
+ I've also added a strawman driver that runs multiple threads but kicks off a 
single app.jar process for every file.
+ 
  === FS Resource Consumers ===
- 
  The tika-batch package includes an abstract !ResourceConsumer class that 
handles much of the multi-threading burden.  Concrete classes of resource 
consumer only have to implement processFileResource(FileResource fileResource). 
 Concrete classes should also handle all exceptions that they want to handle 
and make appropriate calls to incrementHandledExceptions().
  
- There are two consumers currently.  
+ There are two consumers currently.
+ 
   * One handles traditional Tika processing with the !ToTextContentHandler, 
the !ToHtmlContentHandler or the ToXMLContentHandler.  The user can specify a 
write limit and whether or not to process documents recursively.
-  * The other offers handling by the (to be added) !RecursiveParserWrapper.  
For each input file, the output is a json-formatted list of Metadata items, 
with a special key for the content.  
+  * The other offers handling by the (to be added) !RecursiveParserWrapper.  
For each input file, the output is a json-formatted list of Metadata items, 
with a special key for the content.
  
  FSResourceConsumers rely on a !ContentHandlerFactory to get the 
user-specified handler and an !OutputStreamFactory to get the !FileOutputStream 
to write to.
  
@@ -73, +74 @@

  This calculates the output (target) file location and name, builds the 
requisite parent directories and returns the !OutputStream to the consumer.
  
  If a target file exists, this will do one of three things:
+ 
   * Skip it -- return a null !OutputStream. The consumers know to avoid 
parsing a file if the returned !OutputStream is null.
   * Rename the file -- add e.g. (1) to the end of the file name (before the 
suffix) until there is a new file.
   * Overwrite the existing file.
  
- 
- 
  == TODO ==
- 
- Any design recommendations at this point?  See 
[[http://issues.apache.org/jira/i#browse/TIKA-1330 | TIKA-1330]].
+ Any design recommendations at this point?  See 
[[http://issues.apache.org/jira/i#browse/TIKA-1330|TIKA-1330]].
  
  === Tika Server Client ===
- 
  === Tika Batch Hadoop ===
  Anyone want to contribute?

[Tika Wiki] Update of "TikaBatch" by TimothyAllison

Reply via email to