[ 
https://issues.apache.org/jira/browse/TIKA-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083839#comment-18083839
 ] 

ASF GitHub Bot commented on TIKA-4739:
--------------------------------------

tballison commented on code in PR #2837:
URL: https://github.com/apache/tika/pull/2837#discussion_r3310961065


##########
tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/PluginsWriter.java:
##########
@@ -43,50 +43,35 @@ public PluginsWriter(SimpleAsyncConfig simpleAsyncConfig, 
Path pluginsConfig) {
     }
 
     void write(Path output) throws IOException {
-        Path baseInput = StringUtils.isBlank(simpleAsyncConfig.getInputDir())
-                ? Paths.get(".").toAbsolutePath()
-                : Paths.get(simpleAsyncConfig.getInputDir());
-        Path baseOutput = StringUtils.isBlank(simpleAsyncConfig.getOutputDir())
-                ? null
-                : Paths.get(simpleAsyncConfig.getOutputDir());
-        if (Files.isRegularFile(baseInput)) {
+        boolean userConfigProvided = 
!StringUtils.isBlank(simpleAsyncConfig.getTikaConfig());
+        boolean inputExplicit = 
!StringUtils.isBlank(simpleAsyncConfig.getInputDir());
+        boolean outputExplicit = 
!StringUtils.isBlank(simpleAsyncConfig.getOutputDir());
+
+        // Resolve baseInput. If -i is explicit, use it. If not and the user
+        // didn't supply --config, fall back to '.' so the template's
+        // FETCHER_BASE_PATH placeholder gets a sane default. If --config is
+        // supplied and -i isn't, baseInput stays null so we don't trample the
+        // user's own basePath values.
+        Path baseInput = null;
+        if (inputExplicit) {
+            baseInput = Paths.get(simpleAsyncConfig.getInputDir());
+        } else if (!userConfigProvided) {

Review Comment:
   good catch





> tika-4.0.0-alpha1 - configuration file issues
> ---------------------------------------------
>
>                 Key: TIKA-4739
>                 URL: https://issues.apache.org/jira/browse/TIKA-4739
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Adrian Bird
>            Priority: Major
>
> I've got some issues with the configuration and I've put them all in here. 
> *1.* Error in Tika-App Integration Test 20
> The 
> [test|https://tika.apache.org/docs/4.0.0-SNAPSHOT/advanced/integration-testing/tika-app.html#_test_20_create_custom_config_file]
>  has a custom tika-config.json file. When I tried it I got the following 
> error:
> {code:java}
> Exception in thread "main" 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "timeoutMillis" (class 
> org.apache.tika.pipes.core.PipesConfig), not marked as ignorable (25 known 
> properties: "sleepOnStartupTimeoutMillis", "shutdownClientAfterMillis", 
> "numClients", "emitWithinMillis", "configStoreParams", "emitStrategy", 
> "heartbeatIntervalMs", "startupTimeoutMillis", "numEmitters", 
> "staleFetcherTimeoutSeconds", "maxFilesProcessedPerProcess", 
> "useSharedServer", "queueSize", "socketTimeoutMs", "parseMode", 
> "stopOnlyOnFatal", "tempDirectory", "onParseException", "forkedJvmArgs", 
> "maxWaitForClientMillis", "javaPath", "staleFetcherDelaySeconds", 
> "configStoreType", "emitMaxEstimatedBytes", "emitIntermediateResults")
>  at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain: 
> org.apache.tika.pipes.core.PipesConfig["timeoutMillis"]) {code}
> *2.* parsers '_exclude' doesn't seem to work
> Using the config file from Test 20 above, and fixing the issue by using 
> 'startupTimeoutMillis' I tried excluding a parser. I really wanted to do it 
> for Tesseract but decided an easier option was PDF.
> I removed the 'pdf-parser' section from the config and did this:
> {code:java}
>     {
>       "default-parser": {
>         "_exclude": ["pdf-parser"]
>       }
>     },{code}
> When I ran Tika it produced the same output as previously and processed my 
> PDF file.
> *2a.* There is a documentation example that has 'exclude' rather than 
> '_exclude' 
> [vlm-pdf-parsing.json|https://github.com/apache/tika/blob/main/docs/modules/ROOT/examples/vlm-pdf-parsing.json]
> *3.* [Getting Started with Tika 
> Pipes|https://tika.apache.org/docs/4.0.0-SNAPSHOT/pipes/getting-started.html#_json_configuration]
>  JSON Configuration example doesn't work.
> When I try the example using the JSON Configuration I get the following:
> {code:java}
> INFO  [pool-2-thread-1] 08:52:11,748 
> org.apache.tika.pipes.core.server.FetchHandler Couldn't initialize fetcher 
> for fetch id=MyTestFile.pdf
> org.apache.tika.pipes.api.fetcher.FetcherNotFoundException: Can't find 
> fetcher for id=fsf. Available: []{code}
> I assume it is because there is no 'pipes-iterator' in the configuration and 
> it is picking up a default.
> In my tika-config.json I changed the ids to 'fsf' and 'fse' and got the same 
> error.
> I noticed that the structure of the 'fetchers' and 'emitters' is different in 
> this example and the one in 1. above.
> This has an array with an 'id' key/value pair and the one in 1. above has a 
> map with the 'id' being the key.
> I changed the structure to reflect what is in 1. above and it worked (if I 
> left the 'id' key in there I got an error saying 'id' wasn't valid).
> I noticed a lot of test files in the repository that have the format listed 
> in the Getting Started section.
> *My questions are:*
> a. what structure(s) of the 'fetchers' and 'emitters' are supported?
> b. what should the example configuration be?
> *3a.* There is a note below the command to run the config file: ??'The -i and 
> -o flags override the basePath values in the config when used with 
> tika-app.'?? 
> I'm not seeing this. The values used are from the 'basePath'. If neither the 
> '-i' value on the command line, or in the config file exist, I get this 
> message about the value in the config file: 
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.IllegalArgumentException: "basePath" directory does not exist: 
> L:\Apache-Tika\batch-inputxxx
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to