[ 
https://issues.apache.org/jira/browse/TIKA-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078429#comment-18078429
 ] 

Nicholas DiPiazza commented on TIKA-4723:
-----------------------------------------

h2. tika-grpc package size analysis (TIKA-4723)

Measured on current main branch after running {noformat}mvn clean package 
-DskipTests{noformat} for tika-grpc.

h3. Size comparison: tika-grpc vs tika-server-standard

|| Module || Artifact || Size ||
| tika-grpc | tika-grpc-4.0.0-SNAPSHOT.zip | *400 MB* |
| tika-server-standard | tika-server-standard-4.0.0-SNAPSHOT-bin.zip | *69 MB* |

tika-grpc is *~6x larger* than tika-server-standard.

h3. What is inside the tika-grpc ZIP?

The ZIP has two directories:

|| Directory || Size || Contents ||
| tika-grpc/ | 200 MB | Runtime JARs (lib/) |
| plugins/ | 213 MB | 12 pf4j plugin ZIPs |

h3. Top offenders in lib/ (200 MB total)

|| JAR || Size || Root cause ||
| rocksdbjni-10.2.1.jar | *70 MB* | ignite-runner → ignite-vault → 
ignite-storage-rocksdb. Contains 14+ platform native .so/.dll/.jnilib files 
(Linux x86, aarch64, ppc64le, s390x, riscv64, macOS, Windows, musl variants). |
| grpc-netty-shaded-1.81.0.jar | 11 MB | grpc transport |
| bcprov-jdk18on-1.84.jar | 8.6 MB | BouncyCastle, pulled in by Ignite |
| calcite-core-1.40.0.jar | 8.3 MB | Apache Calcite SQL engine, pulled in by 
ignite-sql-engine |
| fastutil-core-8.5.16.jar | 6.4 MB | fast collections, pulled in by 
ignite-sql-engine |
| guava-33.6.0-jre.jar | 3.0 MB | guava |
| poi-5.5.1.jar | 2.9 MB | tika-parsers-standard-package |
| micronaut-openapi-4.10.0.jar | 2.8 MB | ignite-rest → 
micronaut-http-server-netty → micronaut-openapi |
| proto-google-common-protos-2.64.1.jar | 2.8 MB | gRPC protos |
| All ignite-\*.jar combined | ~12.5 MB | ignite-sql-engine, ignite-raft, 
ignite-runner, etc. (50+ JARs) |
| lombok-1.18.46.jar | 2.0 MB | tika-serialization declares it as 
{{<scope>compile</scope>}} instead of {{provided}} |

h3. Top offenders in plugins/ (213 MB total)

|| Plugin ZIP || Size || Main cause ||
| tika-pipes-microsoft-graph | *83 MB* | microsoft-graph-6.61.0.jar (58 MB) + 
netty-quic native libs for 4 platforms |
| tika-pipes-gcs | 46 MB | Google Cloud SDK |
| tika-pipes-az-blob | 29 MB | Azure SDK |
| tika-pipes-solr | 19 MB | Solr client |
| tika-pipes-s3 | 13 MB | AWS SDK |
| tika-pipes-kafka | 10 MB | Kafka client |
| (remaining 6 plugins) | 13 MB | |

h3. Root cause analysis

The single biggest issue is that *tika-pipes-config-store-ignite pulls in a 
full embedded Ignite server node* (via {{ignite-runner}}). 
{{IgniteConfigStore}} itself uses {{IgniteClient}} (thin client), but 
{{IgniteStoreServer}} uses {{IgniteServer}} (full server node) which 
transitively requires:
* {{ignite-storage-rocksdb}} → {{rocksdbjni}} (70 MB of multi-platform native 
binaries)
* {{ignite-sql-engine}} → calcite-core, fastutil-core, avatica-core, janino, 
jts-core (~19 MB)
* {{ignite-rest}} → micronaut-http-server-netty → micronaut-openapi (~6 MB)
* 50+ additional ignite server JARs (~12.5 MB)

Total savings if the embedded Ignite server is split out: *~110+ MB from lib/*

h3. Proposed remediation plan

*Priority 1 — Split embedded Ignite server out of the main bundle (~110 MB 
savings)*
Create a separate module (e.g., {{tika-pipes-config-store-ignite-server}}) for 
{{IgniteStoreServer}}. The {{tika-grpc}} core only depends on {{ignite-client}} 
+ {{ignite-api}} for thin-client access. The embedded server is only needed 
when running without an external Ignite cluster, so it can be an optional 
plugin or a separate Docker sidecar.

*Priority 2 — Fix lombok scope (~2 MB savings)*
Change {{<scope>compile</scope>}} to {{<scope>provided</scope>}} for lombok in 
{{tika-serialization/pom.xml}}. Lombok is an annotation processor and should 
never be a runtime dependency.

*Priority 3 — Make plugins optional in the distribution (~213 MB savings in 
full)*
Instead of bundling all 12 plugins in the ZIP unconditionally, consider:
* A slim ZIP with only file-system + http plugins included by default
* Offering individual plugin downloads
* Making the assembly configurable via a Maven profile (e.g., {{-Pslim}} vs 
{{-Pfull}})

*Priority 4 — Platform-specific rocksdbjni (if Ignite server must stay)*
If the embedded Ignite server cannot be split out immediately, use the 
OS-classified artifact ({{rocksdbjni:jar:linux64}}) to include only the native 
lib for the deployment platform, reducing rocksdbjni from 70 MB to ~15 MB.

*Priority 5 — Exclude micronaut-openapi from ignite-rest scope*
Add an {{<exclusion>}} for {{io.micronaut.openapi:micronaut-openapi}} in the 
{{ignite-runner}} transitive dep chain (~3 MB savings).

h3. Summary of estimated savings

|| Action || Estimated savings ||
| Split IgniteStoreServer into separate module | ~110 MB from lib/ |
| Fix lombok to provided scope | ~2 MB from lib/ |
| Slim default plugin set | 100–200 MB from plugins/ |
| Platform-specific rocksdbjni (fallback if #1 not done) | ~55 MB from lib/ |

Starting with Priority 1 alone would bring tika-grpc closer to 
tika-server-standard in size.


> Slim down grpc?
> ---------------
>
>                 Key: TIKA-4723
>                 URL: https://issues.apache.org/jira/browse/TIKA-4723
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> For 4.0.0-beta, we should figure out if we can slim down tika-grpc mostly 
> just for environmental reasons. It currently weighs in at 648MB.
> If we said we only support it in Docker, we could strip out some native libs.
> Other options? Claude, copilot and/or gemini, please help us save the 
> environment!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to