Adrian Bird created TIKA-4739:
---------------------------------

             Summary: tika-4.0.0-alpha1 - configuration file issues
                 Key: TIKA-4739
                 URL: https://issues.apache.org/jira/browse/TIKA-4739
             Project: Tika
          Issue Type: Bug
            Reporter: Adrian Bird


I've got some issues with the configuration and I've put them all in here. 

*1.* Error in Tika-App Integration Test 20
The 
[test|https://tika.apache.org/docs/4.0.0-SNAPSHOT/advanced/integration-testing/tika-app.html#_test_20_create_custom_config_file]
 has a custom tika-config.json file. When I tried it I got the following error:
{code:java}
Exception in thread "main" 
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized 
field "timeoutMillis" (class org.apache.tika.pipes.core.PipesConfig), not 
marked as ignorable (25 known properties: "sleepOnStartupTimeoutMillis", 
"shutdownClientAfterMillis", "numClients", "emitWithinMillis", 
"configStoreParams", "emitStrategy", "heartbeatIntervalMs", 
"startupTimeoutMillis", "numEmitters", "staleFetcherTimeoutSeconds", 
"maxFilesProcessedPerProcess", "useSharedServer", "queueSize", 
"socketTimeoutMs", "parseMode", "stopOnlyOnFatal", "tempDirectory", 
"onParseException", "forkedJvmArgs", "maxWaitForClientMillis", "javaPath", 
"staleFetcherDelaySeconds", "configStoreType", "emitMaxEstimatedBytes", 
"emitIntermediateResults")
 at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain: 
org.apache.tika.pipes.core.PipesConfig["timeoutMillis"]) {code}
*2.* parsers '_exclude' doesn't seem to work
Using the config file from Test 20 above, and fixing the issue by using 
'startupTimeoutMillis' I tried excluding a parser. I really wanted to do it for 
Tesseract but decided an easier option was PDF.
I removed the 'pdf-parser' section from the config and did this:
{code:java}
    {
      "default-parser": {
        "_exclude": ["pdf-parser"]
      }
    },{code}
When I ran Tika it produced the same output as previously and processed my PDF 
file.

*2a.* There is a documentation example that has 'exclude' rather than 
'_exclude' 
[vlm-pdf-parsing.json|https://github.com/apache/tika/blob/main/docs/modules/ROOT/examples/vlm-pdf-parsing.json]

*3.* [Getting Started with Tika 
Pipes|https://tika.apache.org/docs/4.0.0-SNAPSHOT/pipes/getting-started.html#_json_configuration]
 JSON Configuration example doesn't work.
When I try the example using the JSON Configuration I get the following:
{code:java}
INFO  [pool-2-thread-1] 08:52:11,748 
org.apache.tika.pipes.core.server.FetchHandler Couldn't initialize fetcher for 
fetch id=MyTestFile.pdf
org.apache.tika.pipes.api.fetcher.FetcherNotFoundException: Can't find fetcher 
for id=fsf. Available: []{code}
I assume it is because there is no 'pipes-iterator' in the configuration and it 
is picking up a default.

In my tika-config.json I changed the ids to 'fsf' and 'fse' and got the same 
error.

I noticed that the structure of the 'fetchers' and 'emitters' is different in 
this example and the one in 1. above.
This has an array with an 'id' key/value pair and the one in 1. above has a map 
with the 'id' being the key.
I changed the structure to reflect what is in 1. above and it worked (if I left 
the 'id' key in there I got an error saying 'id' wasn't valid).

I noticed a lot of test files in the repository that have the format listed in 
the Getting Started section.

*My questions are:*
a. what structure(s) of the 'fetchers' and 'emitters' are supported?
b. what should the example configuration be?

*3a.* There is a note below the command to run the config file: ??'The -i and 
-o flags override the basePath values in the config when used with tika-app.'?? 
I'm not seeing this. The values used are from the 'basePath'. If neither the 
'-i' value on the command line, or in the config file exist, I get this message 
about the value in the config file: 
Exception in thread "main" java.lang.RuntimeException: 
java.lang.IllegalArgumentException: "basePath" directory does not exist: 
L:\Apache-Tika\batch-inputxxx

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to