[
https://issues.apache.org/jira/browse/TIKA-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078429#comment-18078429
]
Nicholas DiPiazza commented on TIKA-4723:
-----------------------------------------
h2. tika-grpc package size analysis (TIKA-4723)
Measured on current main branch after running {noformat}mvn clean package
-DskipTests{noformat} for tika-grpc.
h3. Size comparison: tika-grpc vs tika-server-standard
|| Module || Artifact || Size ||
| tika-grpc | tika-grpc-4.0.0-SNAPSHOT.zip | *400 MB* |
| tika-server-standard | tika-server-standard-4.0.0-SNAPSHOT-bin.zip | *69 MB* |
tika-grpc is *~6x larger* than tika-server-standard.
h3. What is inside the tika-grpc ZIP?
The ZIP has two directories:
|| Directory || Size || Contents ||
| tika-grpc/ | 200 MB | Runtime JARs (lib/) |
| plugins/ | 213 MB | 12 pf4j plugin ZIPs |
h3. Top offenders in lib/ (200 MB total)
|| JAR || Size || Root cause ||
| rocksdbjni-10.2.1.jar | *70 MB* | ignite-runner → ignite-vault →
ignite-storage-rocksdb. Contains 14+ platform native .so/.dll/.jnilib files
(Linux x86, aarch64, ppc64le, s390x, riscv64, macOS, Windows, musl variants). |
| grpc-netty-shaded-1.81.0.jar | 11 MB | grpc transport |
| bcprov-jdk18on-1.84.jar | 8.6 MB | BouncyCastle, pulled in by Ignite |
| calcite-core-1.40.0.jar | 8.3 MB | Apache Calcite SQL engine, pulled in by
ignite-sql-engine |
| fastutil-core-8.5.16.jar | 6.4 MB | fast collections, pulled in by
ignite-sql-engine |
| guava-33.6.0-jre.jar | 3.0 MB | guava |
| poi-5.5.1.jar | 2.9 MB | tika-parsers-standard-package |
| micronaut-openapi-4.10.0.jar | 2.8 MB | ignite-rest →
micronaut-http-server-netty → micronaut-openapi |
| proto-google-common-protos-2.64.1.jar | 2.8 MB | gRPC protos |
| All ignite-\*.jar combined | ~12.5 MB | ignite-sql-engine, ignite-raft,
ignite-runner, etc. (50+ JARs) |
| lombok-1.18.46.jar | 2.0 MB | tika-serialization declares it as
{{<scope>compile</scope>}} instead of {{provided}} |
h3. Top offenders in plugins/ (213 MB total)
|| Plugin ZIP || Size || Main cause ||
| tika-pipes-microsoft-graph | *83 MB* | microsoft-graph-6.61.0.jar (58 MB) +
netty-quic native libs for 4 platforms |
| tika-pipes-gcs | 46 MB | Google Cloud SDK |
| tika-pipes-az-blob | 29 MB | Azure SDK |
| tika-pipes-solr | 19 MB | Solr client |
| tika-pipes-s3 | 13 MB | AWS SDK |
| tika-pipes-kafka | 10 MB | Kafka client |
| (remaining 6 plugins) | 13 MB | |
h3. Root cause analysis
The single biggest issue is that *tika-pipes-config-store-ignite pulls in a
full embedded Ignite server node* (via {{ignite-runner}}).
{{IgniteConfigStore}} itself uses {{IgniteClient}} (thin client), but
{{IgniteStoreServer}} uses {{IgniteServer}} (full server node) which
transitively requires:
* {{ignite-storage-rocksdb}} → {{rocksdbjni}} (70 MB of multi-platform native
binaries)
* {{ignite-sql-engine}} → calcite-core, fastutil-core, avatica-core, janino,
jts-core (~19 MB)
* {{ignite-rest}} → micronaut-http-server-netty → micronaut-openapi (~6 MB)
* 50+ additional ignite server JARs (~12.5 MB)
Total savings if the embedded Ignite server is split out: *~110+ MB from lib/*
h3. Proposed remediation plan
*Priority 1 — Split embedded Ignite server out of the main bundle (~110 MB
savings)*
Create a separate module (e.g., {{tika-pipes-config-store-ignite-server}}) for
{{IgniteStoreServer}}. The {{tika-grpc}} core only depends on {{ignite-client}}
+ {{ignite-api}} for thin-client access. The embedded server is only needed
when running without an external Ignite cluster, so it can be an optional
plugin or a separate Docker sidecar.
*Priority 2 — Fix lombok scope (~2 MB savings)*
Change {{<scope>compile</scope>}} to {{<scope>provided</scope>}} for lombok in
{{tika-serialization/pom.xml}}. Lombok is an annotation processor and should
never be a runtime dependency.
*Priority 3 — Make plugins optional in the distribution (~213 MB savings in
full)*
Instead of bundling all 12 plugins in the ZIP unconditionally, consider:
* A slim ZIP with only file-system + http plugins included by default
* Offering individual plugin downloads
* Making the assembly configurable via a Maven profile (e.g., {{-Pslim}} vs
{{-Pfull}})
*Priority 4 — Platform-specific rocksdbjni (if Ignite server must stay)*
If the embedded Ignite server cannot be split out immediately, use the
OS-classified artifact ({{rocksdbjni:jar:linux64}}) to include only the native
lib for the deployment platform, reducing rocksdbjni from 70 MB to ~15 MB.
*Priority 5 — Exclude micronaut-openapi from ignite-rest scope*
Add an {{<exclusion>}} for {{io.micronaut.openapi:micronaut-openapi}} in the
{{ignite-runner}} transitive dep chain (~3 MB savings).
h3. Summary of estimated savings
|| Action || Estimated savings ||
| Split IgniteStoreServer into separate module | ~110 MB from lib/ |
| Fix lombok to provided scope | ~2 MB from lib/ |
| Slim default plugin set | 100–200 MB from plugins/ |
| Platform-specific rocksdbjni (fallback if #1 not done) | ~55 MB from lib/ |
Starting with Priority 1 alone would bring tika-grpc closer to
tika-server-standard in size.
> Slim down grpc?
> ---------------
>
> Key: TIKA-4723
> URL: https://issues.apache.org/jira/browse/TIKA-4723
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> For 4.0.0-beta, we should figure out if we can slim down tika-grpc mostly
> just for environmental reasons. It currently weighs in at 648MB.
> If we said we only support it in Docker, we could strip out some native libs.
> Other options? Claude, copilot and/or gemini, please help us save the
> environment!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)