Chris A. Mattmann created TIKA-1638:
---------------------------------------
Summary: Make ExternalParser actually work
Key: TIKA-1638
URL: https://issues.apache.org/jira/browse/TIKA-1638
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.9
Several issues in ExternalParser cause it to currently not function. They are
enumerated below:
* the class org.apache.tika.parser.external.CompositeExternalParser needs to be
added to the META-INF/services/org.apache.tika.parser.Parser file
* the ExternalParserConfigReader class incorrectly tokenizes the error check
codes which use "," - the StringTokenizer used has a default delimiter set that
doesn't include ","
* the ExternalParserConfigReader does a check before adding Parsers in which it
simply takes the given String command check and then wraps it in a String[].
This causes the check to fail if the command includes spaces in it (which most
will, by its documentation, even). The command needs to be .split(" ") on
whitespace in order for this to work and for ExternalParsers to actually be
created and added.
* the ExternalParser needs to split its command (similar to the
ExternalParserConfigReader) if it includes whitespace (which most commands do)
in order for the command to be successfully executed.
* exception handling needs to be added to the exec command when running the
external command.
* any Threads started in e.g., extractMetadata, sendInput, etc., need to be
started, and then joined, so that they actually finish and complete before
moving on in the function. As it stands, metadata can be sometimes extracted,
and sometimes not, b/c it's done by threads that aren't forced to actually
complete before moving on, parsing, and returning.
I have a patch which fixes all this. Forthcoming.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)