[
https://issues.apache.org/jira/browse/TIKA-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092678#comment-18092678
]
Nicholas DiPiazza commented on TIKA-4735:
-----------------------------------------
h2. Steps to Reproduce
h3. Bug 1: -h conflicts with --help
Run the CLI with +-h+:
{code:bash}
java -jar tika-async-cli.jar -h
{code}
*Before fix:* prints help and exits (the -h short option was bound to --help)
*After fix:* -h is no longer bound; use --help instead
----
h3. Bug 2: --content-only outputs JSON instead of raw content
*Setup:*
{code:bash}
mkdir -p /tmp/tika-demo/{input,output}
printf '<mock><write element="p">Hello TIKA-4735</write></mock>' >
/tmp/tika-demo/input/test.xml
{code}
Run with +--content-only+:
{code:bash}
java -jar tika-async-cli.jar \
--inputDir /tmp/tika-demo/input \
--outputDir /tmp/tika-demo/output \
--handler m --content-only
cat /tmp/tika-demo/output/test.xml.md
{code}
*Before fix:*
{code:json}
[{"X-TIKA:content":"Hello TIKA-4735","Content-Type":"application/mock+xml",...}]
{code}
*After fix:*
{noformat}
Hello TIKA-4735
{noformat}
----
h3. Root Causes Fixed
# *ParseHandler* resolved the effective +ParseMode+ (falling back to the config
default) but never wrote it back into +ParseContext+. +EmitHandler+ reads
+ParseContext.get(ParseMode.class)+ with no fallback, so it always saw null and
emitted JSON.
# *EmitHandler.shouldEmit()* returned false for files under the DYNAMIC
threshold (100 KB), causing the result to go back to the parent process as a
passback. The parent's +AsyncEmitter+ calls +emitter.emit(List)+ which
serialises JSON, bypassing the +emitContentOnly()+ StreamEmitter path entirely.
Both issues are fixed in PR: https://github.com/apache/tika/pull/2919
> tika-4.0.0-alpha1 - batch output contains JSON wrapper and metadata with
> --content-only
> ---------------------------------------------------------------------------------------
>
> Key: TIKA-4735
> URL: https://issues.apache.org/jira/browse/TIKA-4735
> Project: Tika
> Issue Type: Bug
> Affects Versions: 4.0.0
> Environment: Windows 11 with Java 17
> Reporter: Adrian Bird
> Priority: Major
>
> The [Basic Batch Usage
> Documentation|https://tika.apache.org/docs/4.0.0-SNAPSHOT/using-tika/cli/index.html#_basic_batch_usage]
> has this example:
> {noformat}
> java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m
> --content-only{noformat}
> and description:
> This produces .md files in the output directory containing just the extracted
> markdown content — no JSON wrappers, no metadata fields.
> The example doesn't work because -h means help. -h is listed in the options
> section.
> The help that was produced just lists '--handler' for the option.
> My actual issue is with the output of the batch processing. My example:
> {noformat}
> %JAVA_HOME%\bin\java -jar %TIKA_JAR% -i Input -o Output --handler m
> --content-only{noformat}
> creates a .md file but it has a JSON wrapper and metadata fields and the
> content isn't plain text.
> I get a JSON wrapper and metadata for all the --handler formats.
> Also, if I remove the --content-only argument I get a .json file and not a
> .md file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)