[
https://issues.apache.org/jira/browse/TIKA-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083840#comment-18083840
]
ASF GitHub Bot commented on TIKA-4739:
--------------------------------------
tballison commented on code in PR #2837:
URL: https://github.com/apache/tika/pull/2837#discussion_r3310963697
##########
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/AbstractComponentManager.java:
##########
@@ -115,6 +115,21 @@ protected Map<String, ExtensionConfig>
validateAndCollectConfigs(
Map<String, ExtensionConfig> configs = new HashMap<>();
if (configNode != null && !configNode.isNull()) {
+ // Strict shape check. The section must be a JSON object keyed by
+ // instance ID — e.g. {"my-fetcher": {"file-system-fetcher":
{...}}}.
+ // Without this, an array like
+ // "fetchers": [{"file-system-fetcher": {"id": "my-fetcher",
...}}]
+ // would be silently walked past (JsonNode.fields() on an ArrayNode
+ // returns an empty iterator), leaving the manager with no
registered
+ // components and the user with an "Available: []" error at lookup
+ // time instead of at load time.
+ if (!configNode.isObject()) {
+ throw new TikaConfigException(
+ "Invalid '" + getConfigKey() + "' configuration:
expected a JSON "
+ + "object keyed by instance ID, e.g.
{\"my-id\": {\"type-name\": "
+ + "{...config...}}}. Got " +
configNode.getNodeType() + ". "
+ + "(Array-style configurations are not
supported.)");
Review Comment:
I don't think the ignite configs should have any fetcher/emitter configs.
Let's strip those out.
> tika-4.0.0-alpha1 - configuration file issues
> ---------------------------------------------
>
> Key: TIKA-4739
> URL: https://issues.apache.org/jira/browse/TIKA-4739
> Project: Tika
> Issue Type: Bug
> Reporter: Adrian Bird
> Priority: Major
>
> I've got some issues with the configuration and I've put them all in here.
> *1.* Error in Tika-App Integration Test 20
> The
> [test|https://tika.apache.org/docs/4.0.0-SNAPSHOT/advanced/integration-testing/tika-app.html#_test_20_create_custom_config_file]
> has a custom tika-config.json file. When I tried it I got the following
> error:
> {code:java}
> Exception in thread "main"
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:
> Unrecognized field "timeoutMillis" (class
> org.apache.tika.pipes.core.PipesConfig), not marked as ignorable (25 known
> properties: "sleepOnStartupTimeoutMillis", "shutdownClientAfterMillis",
> "numClients", "emitWithinMillis", "configStoreParams", "emitStrategy",
> "heartbeatIntervalMs", "startupTimeoutMillis", "numEmitters",
> "staleFetcherTimeoutSeconds", "maxFilesProcessedPerProcess",
> "useSharedServer", "queueSize", "socketTimeoutMs", "parseMode",
> "stopOnlyOnFatal", "tempDirectory", "onParseException", "forkedJvmArgs",
> "maxWaitForClientMillis", "javaPath", "staleFetcherDelaySeconds",
> "configStoreType", "emitMaxEstimatedBytes", "emitIntermediateResults")
> at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain:
> org.apache.tika.pipes.core.PipesConfig["timeoutMillis"]) {code}
> *2.* parsers '_exclude' doesn't seem to work
> Using the config file from Test 20 above, and fixing the issue by using
> 'startupTimeoutMillis' I tried excluding a parser. I really wanted to do it
> for Tesseract but decided an easier option was PDF.
> I removed the 'pdf-parser' section from the config and did this:
> {code:java}
> {
> "default-parser": {
> "_exclude": ["pdf-parser"]
> }
> },{code}
> When I ran Tika it produced the same output as previously and processed my
> PDF file.
> *2a.* There is a documentation example that has 'exclude' rather than
> '_exclude'
> [vlm-pdf-parsing.json|https://github.com/apache/tika/blob/main/docs/modules/ROOT/examples/vlm-pdf-parsing.json]
> *3.* [Getting Started with Tika
> Pipes|https://tika.apache.org/docs/4.0.0-SNAPSHOT/pipes/getting-started.html#_json_configuration]
> JSON Configuration example doesn't work.
> When I try the example using the JSON Configuration I get the following:
> {code:java}
> INFO [pool-2-thread-1] 08:52:11,748
> org.apache.tika.pipes.core.server.FetchHandler Couldn't initialize fetcher
> for fetch id=MyTestFile.pdf
> org.apache.tika.pipes.api.fetcher.FetcherNotFoundException: Can't find
> fetcher for id=fsf. Available: []{code}
> I assume it is because there is no 'pipes-iterator' in the configuration and
> it is picking up a default.
> In my tika-config.json I changed the ids to 'fsf' and 'fse' and got the same
> error.
> I noticed that the structure of the 'fetchers' and 'emitters' is different in
> this example and the one in 1. above.
> This has an array with an 'id' key/value pair and the one in 1. above has a
> map with the 'id' being the key.
> I changed the structure to reflect what is in 1. above and it worked (if I
> left the 'id' key in there I got an error saying 'id' wasn't valid).
> I noticed a lot of test files in the repository that have the format listed
> in the Getting Started section.
> *My questions are:*
> a. what structure(s) of the 'fetchers' and 'emitters' are supported?
> b. what should the example configuration be?
> *3a.* There is a note below the command to run the config file: ??'The -i and
> -o flags override the basePath values in the config when used with
> tika-app.'??
> I'm not seeing this. The values used are from the 'basePath'. If neither the
> '-i' value on the command line, or in the config file exist, I get this
> message about the value in the config file:
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.IllegalArgumentException: "basePath" directory does not exist:
> L:\Apache-Tika\batch-inputxxx
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)