[
https://issues.apache.org/jira/browse/TIKA-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085193#comment-18085193
]
ASF GitHub Bot commented on TIKA-4746:
--------------------------------------
Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3333606622
##########
tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java:
##########
@@ -815,7 +839,8 @@ private void usage() {
out.println(" -l or --language Output only language");
out.println(" -d or --detect Detect document type");
out.println(" --digest=X Include digest X (md2, md5,
sha1,");
- out.println(" sha256, sha384, sha512");
+ out.println(" sha256, sha384, sha512,");
+ out.println(" sha3_256, sha3_384,
sha3_512)");
Review Comment:
The help text advertises SHA3 digests, but `--digest` wires up
`CommonsDigesterFactory`, and `CommonsDigester` throws
`UnsupportedOperationException` for `SHA3_*` algorithms. This makes
`--digest=sha3_256` fail at runtime, so the usage text is currently misleading.
##########
docs/modules/ROOT/pages/using-tika/cli/index.adoc:
##########
@@ -95,22 +118,150 @@ java -jar tika-app.jar [option...] [file|port...]
|Option |Description
|`-x` or `--xml`
-|Output XHTML (default)
+|Output XHTML content (default)
|`-h` or `--html`
-|Output HTML
+|Output HTML content
|`-t` or `--text`
-|Output plain text
+|Output plain text content (body)
|`--md`
-|Output Markdown
+|Output Markdown content (body)
+
+|`-T` or `--text-main`
+|Output plain text — main content only, via the boilerpipe handler
+
+|`-A` or `--text-all`
+|Output all text content
|`-m` or `--metadata`
|Output metadata only
|`-j` or `--json`
-|Output JSON metadata
+|Output metadata in JSON
+
+|`-y` or `--xmp`
+|Output metadata in XMP
+
+|`-J` or `--jsonRecursive`
+|Output metadata and content from all embedded files. Combine with
`-x`/`-h`/`-t`/`-m` to choose the content type (default: `-x`).
+
+|`-r` or `--pretty-print`
+|For JSON, XML, and XHTML output, add newlines and whitespace for readability.
+
+|`-e<X>` or `--encoding=<X>`
+|Use output encoding `<X>` (e.g. `UTF-8`).
+|===
+
+=== Detection and Language
+
+[cols="1,3"]
+|===
+|Option |Description
+
+|`-d` or `--detect`
+|Detect the document type and print the media type.
+
+|`-l` or `--language`
+|Detect and print only the language.
+|===
+
+=== Content Options
+
+[cols="1,3"]
+|===
+|Option |Description
+
+|`-p<X>` or `--password=<X>`
+|Use document password `<X>` (for encrypted PDFs, OOXML, etc.).
+
+|`--digest=<X>`
+|Include a digest of the parsed bytes. Supported: `md2`, `md5`, `sha1`,
`sha256`, `sha384`, `sha512`, `sha3_256`, `sha3_384`, `sha3_512`. See
xref:configuration/digesters.adoc[Digesters] for the underlying providers.
Review Comment:
This option table claims `--digest` supports `sha3_*`, but tika-app's
`--digest` implementation uses `CommonsDigesterFactory` and will throw for
`SHA3_*`. Please remove SHA3 algorithms here (or qualify them as requiring a
bouncy-castle digester configured via JSON).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/claude-vlm-basic.json:
##########
@@ -0,0 +1,10 @@
+{
+ "parsers": [
+ {
+ "claude-vlm-parser": {
+ "apiKey": "sk-ant-your-key-here",
+ "model": "claude-sonnet-4-20250514"
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/claude-vlm-full.json:
##########
@@ -0,0 +1,20 @@
+{
+ "parsers": [
+ {
+ "claude-vlm-parser": {
+ "baseUrl": "https://api.anthropic.com",
+ "model": "claude-sonnet-4-20250514",
+ "prompt": "Extract all visible text from this image. Return the text
in markdown format, preserving the original structure (headings, lists, tables,
paragraphs). Do not describe the image. Only return the extracted text.",
+ "maxTokens": 4096,
+ "timeoutSeconds": 300,
+ "apiKey": "sk-ant-your-key-here",
+ "inlineContent": true,
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/vlm-pdf-parsing.json:
##########
@@ -0,0 +1,16 @@
+{
+ "parsers": [
+ {
+ "default-parser": {
+ "exclude": ["pdf-parser"]
+ }
+ },
+ {
+ "claude-vlm-parser": {
+ "apiKey": "sk-ant-your-key-here",
+ "model": "claude-sonnet-4-20250514",
+ "prompt": "Extract all text from this document. Return the text in
markdown format, preserving the original structure (headings, lists, tables,
paragraphs). Do not describe the document. Only return the extracted text."
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws
IOException {
return;
}
+ // Destination exists but has no completion marker. Possible causes:
+ // a previous extraction was killed mid-stream, the marker was deleted
+ // out from under us, or something other than our extractor put files
+ // there. Without this cleanup the subsequent Files.move() below will
+ // fail with DirectoryNotEmptyException on every run until a human
+ // manually removes the directory. Treat the half-extracted state as
+ // garbage and rebuild.
+ if (Files.exists(destination)) {
+ LOG.warn("destination {} exists without a completion marker; "
+ + "treating as stale partial extraction and removing",
destination);
+ deleteRecursively(destination);
+ }
Review Comment:
This new stale-extraction cleanup behavior (deleting an existing destination
directory when the completion marker is missing) is a significant behavioral
change and isn't covered by existing tests in `tika-plugins-core`. Adding a
unit test that creates a destination dir without the marker and asserts
`unzipPlugin()` cleans it up and successfully re-extracts would help prevent
regressions (especially around Windows/DirectoryNotEmptyException scenarios).
> tika-4.0.0-alpha1 - General Documentation Comments
> --------------------------------------------------
>
> Key: TIKA-4746
> URL: https://issues.apache.org/jira/browse/TIKA-4746
> Project: Tika
> Issue Type: Bug
> Affects Versions: 4.0.0
> Reporter: Adrian Bird
> Priority: Major
>
> Here are some comments/thoughts etc. from looking at the updated, and
> unchanged, documentation. Some comments may not be valid if associated code
> changes have also been made (which I haven't checked).
> I've only really looked at the Tika App / Pipes / File System combination,
> although I have skimmed most of the others.
> On the web documentation pages there is a static 'Contents' table in the top
> right that doesn't move when you scroll down. It is missing on the following
> pages:
> - using-tika/index.adoc
> - configuration/index.adoc
> - maintainers/index.adoc
> Also, these pages don't open when you click the closed triangle image.
> General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache
> Solr' and 'Apache Kafka'. Should they all be one or the other?
> using-tika/cli/index.adoc
> - Command Line Options - this lists a subset of the options. Shouldn't it
> list all of them i.e. cover the same list that is output when doing `--help`.
> - Tika Pipes processing (the first one) - I think this could be removed as it
> is covered in detail later.
> - Extract Markdown from a file - I think `Extract Markdown from a file` would
> fit better after `Extract metadata as JSON`.
> - How Pipes mode is activated (2nd bullet) - with the released code I get an
> exception if I specify both `--input` and `--output`.
> - How Pipes mode is activated (2nd bullet) - some of the options are not in
> the Batch Options list below.
> - Tika Pipes Options - I would expect this list to match what is output when
> doing `--help`.
> - Tika Pipes Examples - the formatting is different for these examples and
> the ones above
> pipes/index.adoc
> - question - why is there a section on Emitters in this page, rather than in
> the Emitters page?
> pipes/getting-started.adoc
> - JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a
> placeholder token
> - JSON Configuration example - there should be a '=' in '--config
> tika-config.json'
> pipes/iterators.adoc
> - why does it say 'they are not wrapped in a baseConfig block.' This is the
> only mention of 'baseConfig' in the documentation.
> pipes/configuration.adoc
> - Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a
> placeholder token
> pipes/parse-modes.adoc
> - Content Handler Types - this mentions 'ContentHandlerFactory' and
> 'parseContext' which seem like Java names and not JSON Config names.
> - CLI Usage - should it be "The tika-app pipes processor ..." rather than
> 'batch'
> pipes/unpack-config.adoc
> - Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
> - Configuration Options - this should say that these options are defined
> within the 'unpack-config' key.
> - Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be
> 'unpack-config'.
> - CLI Usage - I don't see the '--unpack' option in the `--help` output
>
> pipes/timeouts.adoc
> - CLI Usage - the output from '--help' doesn't seem to show that '--fork'
> etc. can be used in Pipes mode.
> pipes/troubleshooting.adoc
> - Log levels and sensitive data - I didn't see any documentation about
> logging in general and setting log levels.
> pipes/plugins/filesystem.adoc
> - Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder
> token
> - File System Reporter (file-system-reporter)- Configuration - statusFile -
> Does this have to be an absolute path and not a relative path? If so it would
> be worth saying..
> - Status file schema - counts - in my file I'm seeing 'statusCounts' and not
> 'counts'
> - Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather
> than 'timestamp'
> - Status file schema - in my file I'm also seeing 'started'
> configuration/index.adoc
> - I don't see a general overview of the configuration structure here (I know
> Pipes configuration is covered elsewhere). If a user new to Tika comes here
> and is starting with V4 they need more of an overview than is currently here
> e.g. it should cover the top level keys in a JSON config file.
> - There are no links to the VML Parsers, External Parser and Tess4J OCR pages.
> configuration/digesters.adoc
> - Supported Algorithms - the output from '--help' does not mention the last
> three - should it?
>
> migration-to-4x/serialization-4x.adoc
> - Friendly Naming Convention - not really specific to this page, but how do
> users know what the friendly names are. Running the 'list--*' options all
> produce class names.
> advanced/index.adoc
> - it looks like all the topic entries are geared towards using the Java API
> and aren't available through the CLI and JSON configuration, with the
> exception of 'Setting Limits'. Is it worth adding text to this effect?
>
> advanced/language-detection.adoc
> - Overriding Model Selection - says 'Or via Tika’s JSON configuration
> mechanism if you are using SelfConfiguring component loading' - how can it be
> specified in a JSON config file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)