[ 
https://issues.apache.org/jira/browse/TIKA-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092678#comment-18092678
 ] 

Nicholas DiPiazza commented on TIKA-4735:
-----------------------------------------

h2. Steps to Reproduce

h3. Bug 1: -h conflicts with --help

Run the CLI with +-h+:
{code:bash}
java -jar tika-async-cli.jar -h
{code}
*Before fix:* prints help and exits (the -h short option was bound to --help)
*After fix:* -h is no longer bound; use --help instead

----

h3. Bug 2: --content-only outputs JSON instead of raw content

*Setup:*
{code:bash}
mkdir -p /tmp/tika-demo/{input,output}
printf '<mock><write element="p">Hello TIKA-4735</write></mock>' > 
/tmp/tika-demo/input/test.xml
{code}

Run with +--content-only+:
{code:bash}
java -jar tika-async-cli.jar \
  --inputDir /tmp/tika-demo/input \
  --outputDir /tmp/tika-demo/output \
  --handler m --content-only

cat /tmp/tika-demo/output/test.xml.md
{code}

*Before fix:*
{code:json}
[{"X-TIKA:content":"Hello TIKA-4735","Content-Type":"application/mock+xml",...}]
{code}

*After fix:*
{noformat}
Hello TIKA-4735
{noformat}

----

h3. Root Causes Fixed

# *ParseHandler* resolved the effective +ParseMode+ (falling back to the config 
default) but never wrote it back into +ParseContext+. +EmitHandler+ reads 
+ParseContext.get(ParseMode.class)+ with no fallback, so it always saw null and 
emitted JSON.
# *EmitHandler.shouldEmit()* returned false for files under the DYNAMIC 
threshold (100 KB), causing the result to go back to the parent process as a 
passback. The parent's +AsyncEmitter+ calls +emitter.emit(List)+ which 
serialises JSON, bypassing the +emitContentOnly()+ StreamEmitter path entirely.

Both issues are fixed in PR: https://github.com/apache/tika/pull/2919

> tika-4.0.0-alpha1 - batch output contains JSON wrapper and metadata with 
> --content-only
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-4735
>                 URL: https://issues.apache.org/jira/browse/TIKA-4735
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>         Environment: Windows 11 with Java 17
>            Reporter: Adrian Bird
>            Priority: Major
>
> The [Basic Batch Usage 
> Documentation|https://tika.apache.org/docs/4.0.0-SNAPSHOT/using-tika/cli/index.html#_basic_batch_usage]
>  has this example:
> {noformat}
> java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m 
> --content-only{noformat}
> and description:
> This produces .md files in the output directory containing just the extracted 
> markdown content — no JSON wrappers, no metadata fields.
> The example doesn't work because -h means help. -h is listed in the options 
> section.
> The help that was produced just lists '--handler' for the option.
> My actual issue is with the output of the batch processing. My example:
> {noformat}
> %JAVA_HOME%\bin\java -jar %TIKA_JAR%  -i Input -o Output --handler m 
> --content-only{noformat}
> creates a .md file but it has a JSON wrapper and metadata fields and the 
> content isn't plain text.
> I get a JSON wrapper and metadata for all the --handler formats.
> Also, if I remove the --content-only argument I get a .json file and not a 
> .md file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to