Chris A. Mattmann created TIKA-1638:
---------------------------------------

             Summary: Make ExternalParser actually work
                 Key: TIKA-1638
                 URL: https://issues.apache.org/jira/browse/TIKA-1638
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
             Fix For: 1.9


Several issues in ExternalParser cause it to currently not function. They are 
enumerated below:

* the class org.apache.tika.parser.external.CompositeExternalParser needs to be 
added to the META-INF/services/org.apache.tika.parser.Parser file
* the ExternalParserConfigReader class incorrectly tokenizes the error check 
codes which use "," - the StringTokenizer used has a default delimiter set that 
doesn't include ","
* the ExternalParserConfigReader does a check before adding Parsers in which it 
simply takes the given String command check and then wraps it in a String[]. 
This causes the check to fail if the command includes spaces in it (which most 
will, by its documentation, even). The command needs to be .split(" ") on 
whitespace in order for this to work and for ExternalParsers to actually be 
created and added.
* the ExternalParser needs to split its command (similar to the 
ExternalParserConfigReader) if it includes whitespace (which most commands do) 
in order for the command to be successfully executed.
* exception handling needs to be added to the exec command when running the 
external command.
* any Threads started in e.g., extractMetadata, sendInput, etc., need to be 
started, and then joined, so that they actually finish and complete before 
moving on in the function. As it stands, metadata can be sometimes extracted, 
and sometimes not, b/c it's done by threads that aren't forced to actually 
complete before moving on, parsing, and returning.

I have a patch which fixes all this. Forthcoming.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to