Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ErrorsAndExceptions" page has been changed by NickBurch:
https://wiki.apache.org/tika/ErrorsAndExceptions

Comment:
Document errors, exceptions, and things to work out

New page:
This page is an in-progress attempt to document best practice for 
[[http://tika.apache.org/1.3/api/org/apache/tika/parser/Parser.html|Parser]] 
authors on how to handle problematic files

= Allowed Responses =
The 
[[http://tika.apache.org/1.3/api/org/apache/tika/parser/Parser.html#parse%28java.io.InputStream,%20org.xml.sax.ContentHandler,%20org.apache.tika.metadata.Metadata,%20org.apache.tika.parser.ParseContext%29|Parser
 contract]] is that a Tika parser will populate the 
[[http://tika.apache.org/1.3/api/org/apache/tika/metadata/Metadata.html|Metadata]]
 object, send XML events to the 
[[http://download.oracle.com/javase/1.5.0/docs/api/org/xml/sax/ContentHandler.html|ContentHandler]],
 or throw one of:
 * IOException - if the document stream could not be read
 * SAXException - if the SAX events could not be processed
 * !TikaException - if the document could not be parsed

= Suggested Responses =
== Corrupt File ==
If the file is corrupted in some way, and cannot be processed, a !TikaException 
should be thrown (see 
[[http://tika.apache.org/1.3/api/org/apache/tika/parser/Parser.html#parse%28java.io.InputStream,%20org.xml.sax.ContentHandler,%20org.apache.tika.metadata.Metadata,%20org.apache.tika.parser.ParseContext%29|Parser
 contract]])

== File cannot be read ==
If an IO problem occurs when reading the document, an IOException should be 
thrown (see 
[[http://tika.apache.org/1.3/api/org/apache/tika/parser/Parser.html#parse%28java.io.InputStream,%20org.xml.sax.ContentHandler,%20org.apache.tika.metadata.Metadata,%20org.apache.tika.parser.ParseContext%29|Parser
 contract]])

== "Empty" File (No Text) ==
If there is no text in the file, either because it's empty (eg 0 byte text 
file), or because it's a format that doesn't have text (eg an image), then ???

''TBC - should the body be opened then immediately closed, or something else?''

== File is password protected ==
[[http://tika.apache.org/1.3/api/org/apache/tika/exception/EncryptedDocumentException.html|EncryptedDocumentException]]
 (a subtype of TikaException) should be thrown if the file is password 
protected and no/incorrect password is given.

(A PasswordProvider should be placed on the ParseContext)

== Parser can't handle File ==
If the file is in a sub-format that the parser can't handle (eg parser supports 
v2 and v3, document is v1, all share the same mimetype), or uses some options 
that means that parser can't sensibly handle it, then 

''TBC - should this be an exception, or treated as an empty file?''

== Document Structure is Broken ==
If something is very broken with the file / file structure, and it will be 
impossible to output valid XML for it for some reason, then probably a 
SAXException is the right thing

Reply via email to