[ 
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056650#comment-18056650
 ] 

Tim Allison commented on TIKA-4650:
-----------------------------------

Benchmarks by claude

Here are the complete results including tika-3x, formatted for JIRA:
h3. ZIP Parser Benchmark Results
h4. DefaultHandler Mode
||Branch||Small (10 entries)||Medium (1000 entries)||Large (5000 entries)||
|Tika 3.x|11.395 ms|728 ms|4059 ms|
|Main (4.x)|7.558 ms|625 ms|3589 ms|
|Feature (4.x)|7.790 ms|580 ms|3378 ms|
h4. RecursiveParserWrapper Mode
||Branch||Small (10 entries)||Medium (1000 entries)||Large (5000 entries)||
|Tika 3.x|12.810 ms|842 ms|4170 ms|
|Main (4.x)|7.485 ms|622 ms|3645 ms|
|Feature (4.x)|8.444 ms|595 ms|3453 ms|
h4. Performance Comparison vs Tika 3.x
||Mode||Small||Medium||Large||
|DefaultHandler|32% faster|20% faster|17% faster|
|RecursiveParserWrapper|34% faster|29% faster|17% faster|
h4. Key Findings
 * Feature branch (4.x) with full metadata extraction + integrity checking 
outperforms both Tika 3.x and main (4.x)
 * Tika 4.x main is already significantly faster than 3.x (likely due to Java 
17 baseline and other improvements)
 * The new ZipParser adds metadata extraction and integrity checking with *no 
performance penalty*
 * Small ZIPs show minimal overhead (~1-2ms) from the new architecture
 * Larger ZIPs benefit from the ZipFile-based approach with detector hints

Summary: The feature branch is 17-34% faster than Tika 3.x and 4-7% faster than 
main (4.x) on medium/large ZIPs, while adding metadata extraction and integrity 
checking capabilities.

> Improve zip parsing in 4.x
> --------------------------
>
>                 Key: TIKA-4650
>                 URL: https://issues.apache.org/jira/browse/TIKA-4650
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time 
> those have accreted in the PackageParser. Further, there's not great 
> coordination between the zip detector and the zip parser...there are some 
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between 
> detection and parsing for zip files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to