[
https://issues.apache.org/jira/browse/TIKA-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085211#comment-18085211
]
ASF GitHub Bot commented on TIKA-4746:
--------------------------------------
Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3334170529
##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws
IOException {
return;
}
+ // Destination exists but has no completion marker. Possible causes:
+ // a previous extraction was killed mid-stream, the marker was deleted
+ // out from under us, or something other than our extractor put files
+ // there. Without this cleanup the subsequent Files.move() below will
+ // fail with DirectoryNotEmptyException on every run until a human
+ // manually removes the directory. Treat the half-extracted state as
+ // garbage and rebuild.
+ if (Files.exists(destination)) {
+ LOG.warn("destination {} exists without a completion marker; "
+ + "treating as stale partial extraction and removing",
destination);
+ deleteRecursively(destination);
+ }
Review Comment:
If deleteRecursively(destination) fails to fully remove the stale
destination (e.g., due to Windows file locks), the subsequent Files.move() will
keep failing and the code will throw a misleading timeout from
waitForExtractionComplete(). Consider verifying deletion succeeded and failing
fast with a clear IOException when it did not.
##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -24,91 +24,203 @@ This section covers running Apache Tika as a REST server
via `tika-server`.
Tika Server provides a RESTful HTTP interface for parsing documents and
extracting
content. It can be deployed as a standalone service or in a containerized
environment.
+In Tika 4.x, all parsing happens in forked child processes via the Tika Pipes
+infrastructure — the request-handling process never loads parser libraries
directly.
+This provides process isolation (a parser crash or OOM cannot take down the
server)
+at the cost of requiring a Pipes configuration. See
+xref:migration-to-4x/migrating-tika-server-4x.adoc[Migrating Tika Server to
4.x]
+for the full breaking-change list when upgrading from 3.x.
Review Comment:
The overview claims *all* parsing happens in forked child processes and the
request-handling process never loads parser libraries. However, some endpoints
(e.g., `/meta`) still parse in-process via
TikaResource.createParser()/TikaResource.parse(). This should be qualified to
avoid misleading readers about isolation guarantees.
##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -192,6 +192,40 @@ response-body bytes for HTTP-style fetchers (configurable
via
log catches the thrown exception. Lower `maxErrMsgSize`
> tika-4.0.0-alpha1 - General Documentation Comments
> --------------------------------------------------
>
> Key: TIKA-4746
> URL: https://issues.apache.org/jira/browse/TIKA-4746
> Project: Tika
> Issue Type: Bug
> Affects Versions: 4.0.0
> Reporter: Adrian Bird
> Priority: Major
>
> Here are some comments/thoughts etc. from looking at the updated, and
> unchanged, documentation. Some comments may not be valid if associated code
> changes have also been made (which I haven't checked).
> I've only really looked at the Tika App / Pipes / File System combination,
> although I have skimmed most of the others.
> On the web documentation pages there is a static 'Contents' table in the top
> right that doesn't move when you scroll down. It is missing on the following
> pages:
> - using-tika/index.adoc
> - configuration/index.adoc
> - maintainers/index.adoc
> Also, these pages don't open when you click the closed triangle image.
> General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache
> Solr' and 'Apache Kafka'. Should they all be one or the other?
> using-tika/cli/index.adoc
> - Command Line Options - this lists a subset of the options. Shouldn't it
> list all of them i.e. cover the same list that is output when doing `--help`.
> - Tika Pipes processing (the first one) - I think this could be removed as it
> is covered in detail later.
> - Extract Markdown from a file - I think `Extract Markdown from a file` would
> fit better after `Extract metadata as JSON`.
> - How Pipes mode is activated (2nd bullet) - with the released code I get an
> exception if I specify both `--input` and `--output`.
> - How Pipes mode is activated (2nd bullet) - some of the options are not in
> the Batch Options list below.
> - Tika Pipes Options - I would expect this list to match what is output when
> doing `--help`.
> - Tika Pipes Examples - the formatting is different for these examples and
> the ones above
> pipes/index.adoc
> - question - why is there a section on Emitters in this page, rather than in
> the Emitters page?
> pipes/getting-started.adoc
> - JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a
> placeholder token
> - JSON Configuration example - there should be a '=' in '--config
> tika-config.json'
> pipes/iterators.adoc
> - why does it say 'they are not wrapped in a baseConfig block.' This is the
> only mention of 'baseConfig' in the documentation.
> pipes/configuration.adoc
> - Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a
> placeholder token
> pipes/parse-modes.adoc
> - Content Handler Types - this mentions 'ContentHandlerFactory' and
> 'parseContext' which seem like Java names and not JSON Config names.
> - CLI Usage - should it be "The tika-app pipes processor ..." rather than
> 'batch'
> pipes/unpack-config.adoc
> - Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
> - Configuration Options - this should say that these options are defined
> within the 'unpack-config' key.
> - Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be
> 'unpack-config'.
> - CLI Usage - I don't see the '--unpack' option in the `--help` output
>
> pipes/timeouts.adoc
> - CLI Usage - the output from '--help' doesn't seem to show that '--fork'
> etc. can be used in Pipes mode.
> pipes/troubleshooting.adoc
> - Log levels and sensitive data - I didn't see any documentation about
> logging in general and setting log levels.
> pipes/plugins/filesystem.adoc
> - Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder
> token
> - File System Reporter (file-system-reporter)- Configuration - statusFile -
> Does this have to be an absolute path and not a relative path? If so it would
> be worth saying..
> - Status file schema - counts - in my file I'm seeing 'statusCounts' and not
> 'counts'
> - Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather
> than 'timestamp'
> - Status file schema - in my file I'm also seeing 'started'
> configuration/index.adoc
> - I don't see a general overview of the configuration structure here (I know
> Pipes configuration is covered elsewhere). If a user new to Tika comes here
> and is starting with V4 they need more of an overview than is currently here
> e.g. it should cover the top level keys in a JSON config file.
> - There are no links to the VML Parsers, External Parser and Tess4J OCR pages.
> configuration/digesters.adoc
> - Supported Algorithms - the output from '--help' does not mention the last
> three - should it?
>
> migration-to-4x/serialization-4x.adoc
> - Friendly Naming Convention - not really specific to this page, but how do
> users know what the friendly names are. Running the 'list--*' options all
> produce class names.
> advanced/index.adoc
> - it looks like all the topic entries are geared towards using the Java API
> and aren't available through the CLI and JSON configuration, with the
> exception of 'Setting Limits'. Is it worth adding text to this effect?
>
> advanced/language-detection.adoc
> - Overriding Model Selection - says 'Or via Tika’s JSON configuration
> mechanism if you are using SelfConfiguring component loading' - how can it be
> specified in a JSON config file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)