[
https://issues.apache.org/jira/browse/TIKA-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083839#comment-18083839
]
ASF GitHub Bot commented on TIKA-4739:
--------------------------------------
tballison commented on code in PR #2837:
URL: https://github.com/apache/tika/pull/2837#discussion_r3310961065
##########
tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/PluginsWriter.java:
##########
@@ -43,50 +43,35 @@ public PluginsWriter(SimpleAsyncConfig simpleAsyncConfig,
Path pluginsConfig) {
}
void write(Path output) throws IOException {
- Path baseInput = StringUtils.isBlank(simpleAsyncConfig.getInputDir())
- ? Paths.get(".").toAbsolutePath()
- : Paths.get(simpleAsyncConfig.getInputDir());
- Path baseOutput = StringUtils.isBlank(simpleAsyncConfig.getOutputDir())
- ? null
- : Paths.get(simpleAsyncConfig.getOutputDir());
- if (Files.isRegularFile(baseInput)) {
+ boolean userConfigProvided =
!StringUtils.isBlank(simpleAsyncConfig.getTikaConfig());
+ boolean inputExplicit =
!StringUtils.isBlank(simpleAsyncConfig.getInputDir());
+ boolean outputExplicit =
!StringUtils.isBlank(simpleAsyncConfig.getOutputDir());
+
+ // Resolve baseInput. If -i is explicit, use it. If not and the user
+ // didn't supply --config, fall back to '.' so the template's
+ // FETCHER_BASE_PATH placeholder gets a sane default. If --config is
+ // supplied and -i isn't, baseInput stays null so we don't trample the
+ // user's own basePath values.
+ Path baseInput = null;
+ if (inputExplicit) {
+ baseInput = Paths.get(simpleAsyncConfig.getInputDir());
+ } else if (!userConfigProvided) {
Review Comment:
good catch
> tika-4.0.0-alpha1 - configuration file issues
> ---------------------------------------------
>
> Key: TIKA-4739
> URL: https://issues.apache.org/jira/browse/TIKA-4739
> Project: Tika
> Issue Type: Bug
> Reporter: Adrian Bird
> Priority: Major
>
> I've got some issues with the configuration and I've put them all in here.
> *1.* Error in Tika-App Integration Test 20
> The
> [test|https://tika.apache.org/docs/4.0.0-SNAPSHOT/advanced/integration-testing/tika-app.html#_test_20_create_custom_config_file]
> has a custom tika-config.json file. When I tried it I got the following
> error:
> {code:java}
> Exception in thread "main"
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:
> Unrecognized field "timeoutMillis" (class
> org.apache.tika.pipes.core.PipesConfig), not marked as ignorable (25 known
> properties: "sleepOnStartupTimeoutMillis", "shutdownClientAfterMillis",
> "numClients", "emitWithinMillis", "configStoreParams", "emitStrategy",
> "heartbeatIntervalMs", "startupTimeoutMillis", "numEmitters",
> "staleFetcherTimeoutSeconds", "maxFilesProcessedPerProcess",
> "useSharedServer", "queueSize", "socketTimeoutMs", "parseMode",
> "stopOnlyOnFatal", "tempDirectory", "onParseException", "forkedJvmArgs",
> "maxWaitForClientMillis", "javaPath", "staleFetcherDelaySeconds",
> "configStoreType", "emitMaxEstimatedBytes", "emitIntermediateResults")
> at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain:
> org.apache.tika.pipes.core.PipesConfig["timeoutMillis"]) {code}
> *2.* parsers '_exclude' doesn't seem to work
> Using the config file from Test 20 above, and fixing the issue by using
> 'startupTimeoutMillis' I tried excluding a parser. I really wanted to do it
> for Tesseract but decided an easier option was PDF.
> I removed the 'pdf-parser' section from the config and did this:
> {code:java}
> {
> "default-parser": {
> "_exclude": ["pdf-parser"]
> }
> },{code}
> When I ran Tika it produced the same output as previously and processed my
> PDF file.
> *2a.* There is a documentation example that has 'exclude' rather than
> '_exclude'
> [vlm-pdf-parsing.json|https://github.com/apache/tika/blob/main/docs/modules/ROOT/examples/vlm-pdf-parsing.json]
> *3.* [Getting Started with Tika
> Pipes|https://tika.apache.org/docs/4.0.0-SNAPSHOT/pipes/getting-started.html#_json_configuration]
> JSON Configuration example doesn't work.
> When I try the example using the JSON Configuration I get the following:
> {code:java}
> INFO [pool-2-thread-1] 08:52:11,748
> org.apache.tika.pipes.core.server.FetchHandler Couldn't initialize fetcher
> for fetch id=MyTestFile.pdf
> org.apache.tika.pipes.api.fetcher.FetcherNotFoundException: Can't find
> fetcher for id=fsf. Available: []{code}
> I assume it is because there is no 'pipes-iterator' in the configuration and
> it is picking up a default.
> In my tika-config.json I changed the ids to 'fsf' and 'fse' and got the same
> error.
> I noticed that the structure of the 'fetchers' and 'emitters' is different in
> this example and the one in 1. above.
> This has an array with an 'id' key/value pair and the one in 1. above has a
> map with the 'id' being the key.
> I changed the structure to reflect what is in 1. above and it worked (if I
> left the 'id' key in there I got an error saying 'id' wasn't valid).
> I noticed a lot of test files in the repository that have the format listed
> in the Getting Started section.
> *My questions are:*
> a. what structure(s) of the 'fetchers' and 'emitters' are supported?
> b. what should the example configuration be?
> *3a.* There is a note below the command to run the config file: ??'The -i and
> -o flags override the basePath values in the config when used with
> tika-app.'??
> I'm not seeing this. The values used are from the 'basePath'. If neither the
> '-i' value on the command line, or in the config file exist, I get this
> message about the value in the config file:
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.IllegalArgumentException: "basePath" directory does not exist:
> L:\Apache-Tika\batch-inputxxx
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)