[ 
https://issues.apache.org/jira/browse/TIKA-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092709#comment-18092709
 ] 

ASF GitHub Bot commented on TIKA-4735:
--------------------------------------

nddipiazza commented on PR #2919:
URL: https://github.com/apache/tika/pull/2919#issuecomment-4846126048

   Thanks @tballison — addressed both points:
   
   1. **ParseHandler change** — noted it's belt-and-suspenders; kept it since 
it makes the invariant explicit, but agree it's not the load-bearing fix.
   
   2. **Tests** — removed the 4 tests that didn't exercise the actual changed 
code (arg-parsing tests in `AsyncCliParserTest`, duplicate config-generation 
tests in `TikaConfigAsyncWriterTest`). Added 
`AsyncProcessorTest#testContentOnlyDynamicEmitStrategy` which uses 
`emitStrategy=DYNAMIC` with a 10 MB threshold so the small test file goes 
through the passback path — exactly the scenario where the bug manifested. 
Without the `EmitHandler` fix that test would write JSON; with it, it gets raw 
text.




> tika-4.0.0-alpha1 - batch output contains JSON wrapper and metadata with 
> --content-only
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-4735
>                 URL: https://issues.apache.org/jira/browse/TIKA-4735
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>         Environment: Windows 11 with Java 17
>            Reporter: Adrian Bird
>            Priority: Major
>
> The [Basic Batch Usage 
> Documentation|https://tika.apache.org/docs/4.0.0-SNAPSHOT/using-tika/cli/index.html#_basic_batch_usage]
>  has this example:
> {noformat}
> java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m 
> --content-only{noformat}
> and description:
> This produces .md files in the output directory containing just the extracted 
> markdown content — no JSON wrappers, no metadata fields.
> The example doesn't work because -h means help. -h is listed in the options 
> section.
> The help that was produced just lists '--handler' for the option.
> My actual issue is with the output of the batch processing. My example:
> {noformat}
> %JAVA_HOME%\bin\java -jar %TIKA_JAR%  -i Input -o Output --handler m 
> --content-only{noformat}
> creates a .md file but it has a JSON wrapper and metadata fields and the 
> content isn't plain text.
> I get a JSON wrapper and metadata for all the --handler formats.
> Also, if I remove the --content-only argument I get a .json file and not a 
> .md file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to