[ 
https://issues.apache.org/jira/browse/TIKA-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085211#comment-18085211
 ] 

ASF GitHub Bot commented on TIKA-4746:
--------------------------------------

Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3334170529


##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws 
IOException {
             return;
         }
 
+        // Destination exists but has no completion marker. Possible causes:
+        // a previous extraction was killed mid-stream, the marker was deleted
+        // out from under us, or something other than our extractor put files
+        // there. Without this cleanup the subsequent Files.move() below will
+        // fail with DirectoryNotEmptyException on every run until a human
+        // manually removes the directory. Treat the half-extracted state as
+        // garbage and rebuild.
+        if (Files.exists(destination)) {
+            LOG.warn("destination {} exists without a completion marker; "
+                    + "treating as stale partial extraction and removing", 
destination);
+            deleteRecursively(destination);
+        }

Review Comment:
   If deleteRecursively(destination) fails to fully remove the stale 
destination (e.g., due to Windows file locks), the subsequent Files.move() will 
keep failing and the code will throw a misleading timeout from 
waitForExtractionComplete(). Consider verifying deletion succeeded and failing 
fast with a clear IOException when it did not.



##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -24,91 +24,203 @@ This section covers running Apache Tika as a REST server 
via `tika-server`.
 Tika Server provides a RESTful HTTP interface for parsing documents and 
extracting
 content. It can be deployed as a standalone service or in a containerized 
environment.
 
+In Tika 4.x, all parsing happens in forked child processes via the Tika Pipes
+infrastructure — the request-handling process never loads parser libraries 
directly.
+This provides process isolation (a parser crash or OOM cannot take down the 
server)
+at the cost of requiring a Pipes configuration. See
+xref:migration-to-4x/migrating-tika-server-4x.adoc[Migrating Tika Server to 
4.x]
+for the full breaking-change list when upgrading from 3.x.

Review Comment:
   The overview claims *all* parsing happens in forked child processes and the 
request-handling process never loads parser libraries. However, some endpoints 
(e.g., `/meta`) still parse in-process via 
TikaResource.createParser()/TikaResource.parse(). This should be qualified to 
avoid misleading readers about isolation guarantees.



##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -192,6 +192,40 @@ response-body bytes for HTTP-style fetchers (configurable 
via
 log catches the thrown exception. Lower `maxErrMsgSize` 

> tika-4.0.0-alpha1 - General Documentation Comments
> --------------------------------------------------
>
>                 Key: TIKA-4746
>                 URL: https://issues.apache.org/jira/browse/TIKA-4746
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Adrian Bird
>            Priority: Major
>
> Here are some comments/thoughts etc. from looking at the updated, and 
> unchanged, documentation. Some comments may not be valid if associated code 
> changes have also been made (which I haven't checked). 
> I've only really looked at the Tika App / Pipes / File System combination, 
> although I have skimmed most of the others.
> On the web documentation pages there is a static 'Contents' table in the top 
> right that doesn't move when you scroll down. It is missing on the following 
> pages:
> - using-tika/index.adoc
> - configuration/index.adoc
> - maintainers/index.adoc
> Also, these pages don't open when you click the closed triangle image.
> General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache 
> Solr' and 'Apache Kafka'. Should they all be one or the other?
> using-tika/cli/index.adoc
> - Command Line Options - this lists a subset of the options. Shouldn't it 
> list all of them i.e. cover the same list that is output when doing `--help`.
> - Tika Pipes processing (the first one) - I think this could be removed as it 
> is covered in detail later.
> - Extract Markdown from a file - I think `Extract Markdown from a file` would 
> fit better after `Extract metadata as JSON`.
> - How Pipes mode is activated (2nd bullet) - with the released code I get an 
> exception if I specify both `--input` and `--output`.
> - How Pipes mode is activated (2nd bullet) - some of the options are not in 
> the Batch Options list below.
> - Tika Pipes Options - I would expect this list to match what is output when 
> doing `--help`.
> - Tika Pipes Examples - the formatting is different for these examples and 
> the ones above
> pipes/index.adoc
> - question - why is there a section on Emitters in this page, rather than in 
> the Emitters page?
> pipes/getting-started.adoc
> - JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a 
> placeholder token
> - JSON Configuration example - there should be a '=' in '--config 
> tika-config.json'
> pipes/iterators.adoc
> - why does it say 'they are not wrapped in a baseConfig block.' This is the 
> only mention of 'baseConfig' in the documentation.
> pipes/configuration.adoc
> - Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a 
> placeholder token
> pipes/parse-modes.adoc
> - Content Handler Types - this mentions 'ContentHandlerFactory' and 
> 'parseContext' which seem like Java names and not JSON Config names.
> - CLI Usage - should it be "The tika-app pipes processor ..." rather than 
> 'batch'
> pipes/unpack-config.adoc
> - Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
> - Configuration Options - this should say that these options are defined 
> within the 'unpack-config' key.
> - Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be 
> 'unpack-config'.
> - CLI Usage - I don't see the '--unpack' option in the `--help` output
>  
> pipes/timeouts.adoc
> - CLI Usage - the output from '--help' doesn't seem to show that '--fork' 
> etc. can be used in Pipes mode.
> pipes/troubleshooting.adoc
> - Log levels and sensitive data - I didn't see any documentation about 
> logging in general and setting log levels.
> pipes/plugins/filesystem.adoc
> - Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder 
> token
> - File System Reporter (file-system-reporter)- Configuration - statusFile - 
> Does this have to be an absolute path and not a relative path? If so it would 
> be worth saying..
> - Status file schema - counts - in my file I'm seeing 'statusCounts' and not 
> 'counts'
> - Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather 
> than 'timestamp'
> - Status file schema - in my file I'm also seeing 'started'
> configuration/index.adoc
> - I don't see a general overview of the configuration structure here (I know 
> Pipes configuration is covered elsewhere). If a user new to Tika comes here 
> and is starting with V4 they need more of an overview than is currently here 
> e.g. it should cover the top level keys in a JSON config file.
> - There are no links to the VML Parsers, External Parser and Tess4J OCR pages.
> configuration/digesters.adoc
> - Supported Algorithms - the output from '--help' does not mention the last 
> three - should it?
>  
> migration-to-4x/serialization-4x.adoc
> - Friendly Naming Convention - not really specific to this page, but how do 
> users know what the friendly names are. Running the 'list--*' options all 
> produce class names. 
> advanced/index.adoc 
> - it looks like all the topic entries are geared towards using the Java API 
> and aren't available through the CLI and JSON configuration, with the 
> exception of 'Setting Limits'. Is it worth adding text to this effect?
>  
> advanced/language-detection.adoc 
> - Overriding Model Selection - says 'Or via Tika’s JSON configuration 
> mechanism if you are using SelfConfiguring component loading' - how can it be 
> specified in a JSON config file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to