(opennlp-sandbox) branch OPENNLP-1833-grpc-expansion updated: initial design doc and merged from main

kristian Sat, 06 Jun 2026 14:13:54 -0700

This is an automated email from the ASF dual-hosted git repository.

krickert pushed a commit to branch OPENNLP-1833-grpc-expansion
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git



The following commit(s) were added to refs/heads/OPENNLP-1833-grpc-expansion by 
this push:
     new ee9b8750 initial design doc and merged from main
ee9b8750 is described below

commit ee9b8750bb02597eccd707bb5ff481d71dac7b5f
Author: Kristian Rickert <[email protected]>
AuthorDate: Sat Jun 6 17:13:39 2026 -0400

    initial design doc and merged from main
---
 opennlp-grpc/docs/rfc/opennlp-grpc-design.md       | 993 +++++++++++++++++++++
 .../docs/rfc/opennlp-grpc-jira-proposal.md         | 233 +++++
 .../apache/opennlp/grpc/v1/opennlp_document.proto  | 165 ++++
 .../apache/opennlp/grpc/v1/opennlp_pipeline.proto  | 117 +++
 .../apache/opennlp/grpc/v1/opennlp_service.proto   |  92 ++
 .../tools/jsmlearning/TreeKernelRunner.java        |   2 +-
 .../src/test/resources/sentence_parseObject.csv    |   6 +-
 7 files changed, 1604 insertions(+), 4 deletions(-)

diff --git a/opennlp-grpc/docs/rfc/opennlp-grpc-design.md 
b/opennlp-grpc/docs/rfc/opennlp-grpc-design.md
new file mode 100644
index 00000000..c9a8d52c
--- /dev/null
+++ b/opennlp-grpc/docs/rfc/opennlp-grpc-design.md
@@ -0,0 +1,993 @@
+# OpenNLP gRPC API - Design Document (Phase 1)
+
+## Summary
+
+OpenNLP is a mature JVM library. Teams load models, run tokenizers and 
taggers, extract entities, and-more and more-generate embeddings, all 
in-process inside a Java application. That model still makes sense for many use 
cases. But a lot of modern stacks do not look like that: Python data pipelines, 
Go or Rust microservices, search platforms that want annotated text with chunks 
and vectors in one pass, and deployments where GPU-backed inference belongs on 
a shared service rather than in  [...]
+
+The sandbox gRPC proof of concept showed that exposing OpenNLP over the 
network works. This RFC is the next step: evolve that POC into a 
**document-centric, language-neutral** API. You send text (or a partially 
analyzed document) and get back a single, structured result-sentences and 
tokens, named entities, optional syntactic chunks, multiple segmentation 
strategies each with their own embeddings, and diagnostics when something 
optional did not quite land. The core library stays free of  [...]
+
+### Why we're doing this
+
+The legacy sandbox exposed separate RPCs per tool-tokenize here, tag there, 
find entities somewhere else. That is faithful to how the Java API is 
organized, but it pushes orchestration onto every client. Real workflows need 
the full picture: linguistic structure, retrieval-oriented chunks, and vectors, 
produced in the right order without the caller wiring six calls together.
+
+We also hear clearly from the community that **chunking and embeddings belong 
in v1**, not as a later add-on. Search, hybrid retrieval, and RAG-style 
indexing all want "give me this document, chunked and embedded, ready to 
index"-often with more than one chunking strategy in the same run so you can 
compare sentence-level vs. fixed-window approaches without paying for 
tokenization and NER three times over.
+
+Finally, OpenNLP should not require every downstream system to host the JVM. A 
strongly typed binary protocol-protobuf over gRPC-is how many services already 
talk to each other. Meeting that expectation lowers the friction for polyglot 
teams and for platforms that already standardize on gRPC for internal APIs.
+
+### What it can unlock
+
+**gRPC-native integration.** Systems that already speak gRPC get a first-class 
way to call OpenNLP: discover what models and profiles the server offers, 
submit a document, receive a typed result. No one-off REST schemas, no ad-hoc 
JSON field naming, no JNI shim in every language binding.
+
+**Polyglot document enrichment.** A Python ingestion job, a Go API layer, and 
a Rust indexer can all send the same document shape and receive the same 
annotated structure back. That makes cross-language pipelines easier to build, 
test, and operate-you are not maintaining parallel "how we call OpenNLP" 
stories in every repo.
+
+**Streaming and incremental results.** Long documents and live text feeds 
should not block on one monolithic response at the end. The contract is shaped 
so analysis can stream partial results as they are ready-sentences as they are 
detected, chunks and embeddings as groups complete-rather than forcing the 
client to wait for the entire pipeline to finish.
+
+**Shared NLP infrastructure.** One well-provisioned OpenNLP server-with GPUs 
when embedding workloads warrant it-can serve many lightweight clients. Model 
loading, versioning, and heavy inference concentrate where the hardware is, 
instead of duplicating JVMs and model bundles across every service.
+
+**Search, RAG, and semantic indexing in one shot.** Multiple chunk-and-embed 
configurations in a single analysis run means a single ingestion path can feed 
a sentence-level index, a fixed-window RAG store, and an experimental strategy 
side by side. The linguistic backbone is computed once; each strategy gets its 
own group of chunks with embeddings carried inside them.
+
+**Two-way flexibility for JVM teams.** Non-JVM clients call the server over 
the network. Java applications can do the same when they want to offload heavy 
steps-or keep using a pure-Java processor in-process when that is simpler. Same 
conceptual document model either way, with a path later toward a small 
gRPC-free core type in opennlp-api.
+
+Phase 1 is agreement on this contract-the protos and the design captured here. 
Implementation in the sandbox and graduation toward an Apache OpenNLP release 
follow once the community is comfortable with the shape.
+
+## Design
+
+**Canonical location:** Living design doc for 
[OPENNLP-1833](https://issues.apache.org/jira/browse/OPENNLP-1833). Active work 
happens in **opennlp-sandbox** on branch `OPENNLP-1833-grpc-expansion`. Proto 
sources: `opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/`.
+
+
+| Field                | Value                                            |
+| -------------------- | ------------------------------------------------ |
+| **Status**           | Draft RFC                                        |
+| **Version**          | 0.5                                              |
+| **API version**      | `v1`                                             |
+| **OpenNLP baseline** | 3.0.0-SNAPSHOT (JDK 21+)                         |
+| **Companion**        | [JIRA proposal](./opennlp-grpc-jira-proposal.md) |
+
+
+---
+
+# Part I - Overview
+
+## Where we are
+
+This RFC evolves the sandbox gRPC POC into a **document-centric, 
language-neutral** contract: one `AnalyzeDocument` RPC that takes `raw_text` in 
and returns an enriched `OpenNlpDocument` - sentences, tokens, entities, 
optional chunk groups with embeddings, diagnostics, and more.
+
+**What exists today (on the sandbox branch):**
+
+
+| Artifact                                                                     
           | Status                          |
+| 
---------------------------------------------------------------------------------------
 | ------------------------------- |
+| Legacy POC (`opennlp.proto`, 3 granular services, `OpenNLPServer`)           
           | Committed, unchanged            |
+| v1 protos (`opennlp_document.proto`, `opennlp_pipeline.proto`, 
`opennlp_service.proto`) | Written, not yet wired to build |
+| This RFC + JIRA companion                                                    
           | Written                         |
+| v1 Java processor / server / codegen                                         
           | Not started                     |
+| Core `opennlp-api` `Document` interface                                      
           | Proposed for later              |
+
+
+**Phase 1 deliverable:** design consensus + stable wire contract. **Phase 2:** 
implementation in the sandbox, then graduation to `apache/opennlp` (target 
~3.1.x per community feedback).
+
+## Target architecture
+
+One call. Shared NLP backbone computed once. Multiple chunking strategies, 
each with explicitly named embedding models. Embeddings live **inside** each 
chunk in the reply.
+
+```mermaid
+flowchart TB
+  subgraph request [AnalyzeDocumentRequest]
+    DOC[OpenNlpDocument raw_text]
+    BASE[profile / profile_id - optional]
+    MULTI[chunk_embed_configs]
+    MULTI --> E1["strategy A + embedding_model_ids"]
+    MULTI --> E2["strategy B + embedding_model_ids"]
+  end
+
+  subgraph server [Server - Phase 2]
+    PROC[Pure-Java processor]
+    PROC --> SHARED["Shared NLP once: sentences, tokens, NER, ..."]
+    PROC --> G1["Group A: chunks with embeddings inside"]
+    PROC --> G2["Group B: chunks with embeddings inside"]
+  end
+
+  subgraph response [OpenNlpDocument reply]
+    SENT[AnnotatedSentence backbone]
+    GRP[chunk_embedding_groups]
+    GRP --> C1["Chunk: text_content + embeddings"]
+  end
+
+  request --> server
+  server --> response
+```
+
+
+
+**Request rule:** chunking strategy first. Per strategy, the caller names 
which `embedding_model_ids` to apply to that strategy's chunks. This is **not** 
an automatic N×M cartesian product unless the caller explicitly requests 
multiple strategies.
+
+**Reply rule:** `ChunkEmbeddingGroup` holds the chunks for one strategy. Each 
`Chunk` repeats `text_content` for convenience and carries `repeated 
EmbeddingResult embeddings` inside it. The shared `AnnotatedSentence` list is 
the linguistic backbone computed once.
+
+## Key design decisions
+
+
+| Topic                    | Decision                                          
                                     |
+| ------------------------ | 
--------------------------------------------------------------------------------------
 |
+| **Primary RPC**          | `OpenNlpAnalysisService.AnalyzeDocument`          
                                     |
+| **Package**              | `org.apache.opennlp.grpc.v1`                      
                                     |
+| **Legacy services**      | Retain under `org.apache.opennlp.grpc.legacy.v1` 
during transition                     |
+| **Core library**         | Stays gRPC-free; wire API in optional Maven 
modules                                    |
+| **Build**                | Maven + `protobuf-maven-plugin` only              
                                     |
+| **CHUNK + EMBED**        | First-class v1 steps (`ChunkerME`, 
`SentenceVectorsDL`, segmentation chunking)         |
+| **GPU / providers**      | Hot-swappable via `InferenceBackend` + provider 
SPI; CUDA/OpenVINO as optional modules |
+| **Multi-group**          | `repeated ChunkEmbedConfigEntry` in request → 
`repeated ChunkEmbeddingGroup` in reply  |
+| **Embeddings placement** | Inside `Chunk`, not a separate flat list per 
model                                     |
+| **Partial failures**     | Required steps fail the RPC; optional steps 
return best-effort + diagnostics           |
+| **Stateless contract**   | One document per RPC; `clear_adaptive_data` 
controls NER adaptive state only           |
+
+
+## Community consensus (dev@, May–June 2026)
+
+Feedback from Martin Wiesner, Richard Zowalla, and others on OPENNLP-1833:
+
+- **Sandbox-first** - iterate here, graduate to main after review; no rush for 
3.0.0.
+- **Neutral core `Document` interface** - small gRPC-free type in 
`opennlp-api` later; `OpenNlpDocument` is the wire form.
+- **Embeddings and chunking in v1** - not deferred; GPU acceleration via 
optional provider modules.
+- **Discovery** - `ListModelBundles` + `GetServiceInfo` must expose enough 
metadata to choose bundles/profiles.
+- **Two-way usage** - other languages call the server; JVM code can call the 
server via stubs for heavy steps; a pure-Java processor underneath supports 
in-process use too.
+
+## What comes next
+
+1. Community review of this RFC + v1 protos.
+2. Wire `protobuf-maven-plugin`, generate stubs.
+3. Pure-Java processor: shared NLP once → per-strategy groups → embeddings 
inside chunks.
+4. Minimal `AnalyzeDocument` server implementation.
+5. Propose core `Document` / `AnalyzedDocument` API in `opennlp-api`.
+
+---
+
+# Part II - Specification
+
+## 1. Goals
+
+1. Define a **language-neutral, document-centric** gRPC contract for Apache 
OpenNLP inference.
+2. Enable **cross-platform clients** (Python, Go, Rust, etc.) without JNI or 
embedding the JVM in every service.
+3. Support a **single-call pipeline** (`AnalyzeDocument`) that replaces 
client-side chaining of granular RPCs.
+4. Preserve a clean separation: **core library stays gRPC-free**; wire API 
lives in optional Maven modules.
+5. Include **CHUNK and EMBED as first-class v1 steps** (using existing OpenNLP 
`ChunkerME` and `SentenceVectorsDL` from opennlp-dl for ONNX embeddings). 
Advanced GPU acceleration (CUDA via onnxruntime-gpu, OpenVINO for Intel) and 
hot-swappable provider implementations live behind a narrow middle interface / 
provider SPI; these can be delivered in separate optional modules/builds 
without changing the wire contract or core processor.
+
+## 2. Non-goals
+
+See JIRA proposal. Additionally for Phase 1 design only: no server 
implementation, no deployment guide, no performance SLAs.
+
+## 3. Background
+
+### 3.1 Main repository
+
+- Maven multi-module library; public API in `opennlp-api`, engines in 
`opennlp-runtime`.
+- No `.proto` files or gRPC dependencies on `main`.
+- NLP tasks map to Java interfaces (`Tokenizer`, `SentenceDetector`, 
`POSTagger`, `TokenNameFinder`, etc.).
+
+### 3.2 Sandbox POC
+
+Location: 
[https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc](https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc)
+
+Modules:
+
+- `opennlp-grpc-api` - `opennlp.proto`, generated stubs
+- `opennlp-grpc-service` - server, per-tool services, directory/JAR model 
scanning
+- `examples` - Python client
+
+Limitations motivating this redesign:
+
+- Per-tool services and string payloads
+- No document-level result aggregation
+- No pipeline profile or step diagnostics
+- `package opennlp` / `java_outer_classname` bundling (discouraged for 
multi-file generation)
+
+### 3.3 UIMA reference pipeline
+
+`OpenNlpTextAnalyzer.xml` delegates: SentenceDetector → Tokenizer → 
NameFinders → PosTagger → Chunker → Parser.
+
+The gRPC server orchestrator should mirror this **order** when steps are 
enabled in `AnalysisProfile`.
+
+### 3.4 Deep learning / GPU (v1 + provider evolution)
+
+- `opennlp-dl`: ONNX Runtime support including `SentenceVectorsDL` for 
embeddings, plus `NameFinderDL` and `DocumentCategorizerDL`. These are the 
foundation for the v1 `EMBED` step (and future DL-backed NER/categorization).
+- `opennlp-dl-gpu`: swaps the CPU onnxruntime for `onnxruntime_gpu` (CUDA on 
NVIDIA). This is one of the concrete provider implementations behind the 
hot-swap story.
+- A narrow provider SPI / middle interface (behind `InferenceBackend` and 
per-component selection in profiles/options) allows the pure-Java processor 
(and thus the gRPC server) to dispatch `EMBED` (and later other steps) to 
different backends. Concrete providers for CUDA, a future OpenVINO backend 
(Intel GPU/accelerators), DJL, or even remote endpoints (KServe v2 or another 
OpenNLP gRPC instance) can live in separate optional modules with their own 
build artifacts and dependencies. The b [...]
+- CHUNK and EMBED (with basic ONNX) are in-scope for the initial v1 contract 
and sandbox implementation. Advanced acceleration and additional providers are 
implementation work that does not require wire changes.
+
+The initiating email for OPENNLP-1833 emphasizes GPU embeddings (CUDA for 
NVIDIA, OpenVINO for Intel) with a hot-swappable middle interface whose 
implementations are their own builds. This design makes that explicit via the 
provider mechanism while keeping the `OpenNlpDocument` / `AnalyzeDocument` 
contract stable.
+
+---
+
+## 4. Architecture
+
+### 4.1 Module layout (implementation phases 2+)
+
+```
+apache/opennlp/
+├── opennlp-api/              # unchanged
+├── opennlp-runtime/          # unchanged
+├── opennlp-grpc-api/         # NEW: protos + generated code
+├── opennlp-grpc-server/     # NEW: Netty/shaded server, orchestrator
+└── opennlp-grpc-examples/    # NEW: optional samples
+```
+
+Dependency rule: `opennlp-grpc-server` → `opennlp-grpc-api`, 
`opennlp-runtime`, `opennlp-model-resolver`; optional `opennlp-dl-gpu`.
+
+### 4.2 Three-layer proto model
+
+```mermaid
+flowchart TB
+  subgraph L3 [Layer 3 - Service]
+    SVC[OpenNlpAnalysisService]
+  end
+  subgraph L2 [Layer 2 - Pipeline]
+    PROF[AnalysisProfile]
+    OPT[AnalysisOptions]
+    BUNDLE[ModelBundleRef]
+  end
+  subgraph L1 [Layer 1 - Domain]
+    DOC[OpenNlpDocument]
+  end
+  SVC --> PROF
+  SVC --> OPT
+  PROF --> BUNDLE
+  PROF --> DOC
+  SVC --> DOC
+```
+
+
+
+
+| Layer | File (proposed)          | Responsibility                            
               |
+| ----- | ------------------------ | 
-------------------------------------------------------- |
+| 1     | `opennlp_document.proto` | Document, spans, tokens, entities, 
analytics, embeddings |
+| 2     | `opennlp_pipeline.proto` | Profiles, steps, model refs, options, 
backends           |
+| 3     | `opennlp_service.proto`  | gRPC services and request/response 
envelopes             |
+
+
+All files share `package org.apache.opennlp.grpc.v1`.
+
+### 4.3 Runtime flow
+
+```mermaid
+sequenceDiagram
+  participant C as Client
+  participant S as GrpcServer
+  participant M as ModelBundleCache
+  participant N as OpenNlpRuntime
+
+  C->>S: AnalyzeDocument(doc, profile, options)
+  S->>S: Validate doc_id, raw_text
+  S->>M: Resolve ModelBundleRef
+  alt LANGUAGE_DETECT in profile
+    S->>N: LanguageDetectorME
+  end
+  S->>N: SentenceDetectorME(raw_text)
+  loop Each sentence span
+    S->>N: TokenizerME
+    S->>N: POSTaggerME
+    opt NER in profile
+      S->>N: NameFinderME per model type
+    end
+    opt CHUNK in profile
+      S->>N: ChunkerME
+    end
+  end
+  S->>S: CharSpanMapper to OpenNlpDocument
+  S-->>C: AnalyzeDocumentResponse
+```
+
+
+
+---
+
+## 5. Offset and span contract
+
+OpenNLP Java APIs mix coordinate systems:
+
+
+| API                              | Span reference                    |
+| -------------------------------- | --------------------------------- |
+| `Tokenizer.tokenizePos`          | Character offsets in input string |
+| `SentenceDetector.sentPosDetect` | Character offsets in document     |
+| `TokenNameFinder.find(String[])` | **Token indices** in sentence     |
+| `DocumentNameFinder`             | Per-sentence token indices        |
+
+
+**Wire contract (mandatory for v1):**
+
+- Every `CharSpan` in `OpenNlpDocument` and in RPC responses MUST use 
`CoordinateSpace.CHAR_DOCUMENT` unless explicitly documented otherwise.
+- Offsets are **half-open** `[start, end)` into `raw_text`, matching 
`opennlp.tools.util.Span`.
+- The server is solely responsible for converting token-index spans from 
`NameFinderME` to character spans before returning.
+
+---
+
+## 6. Model lifecycle
+
+### 6.1 Server-side models
+
+- Classic models: Java-serialized `.bin` in ZIP/JAR (unchanged).
+- Models are **never** sent inline in `AnalyzeDocumentRequest`.
+- Server loads from configurable directory/classpath (port sandbox 
`model.location`, wildcards).
+
+### 6.2 ModelBundleRef and discovery
+
+`ModelBundleRef` is a compact logical handle used in requests:
+
+```protobuf
+message ModelBundleRef {
+  string bundle_id = 1;
+  map<string, string> component_keys = 2;
+}
+```
+
+Example `component_keys`: `tokenizer`, `sentence_detector`, `pos`, 
`ner_person`, `ner_org`, `embed_minilm`, `langdetect`.
+
+Server config (or a model resolver) maps `bundle_id` → concrete 
artifacts/paths. Clients can send only `bundle_id` when using server-defined 
profiles.
+
+**Discovery (addresses community feedback)**: A bare `bundle_id` is not 
sufficient for clients to explore what is available. The service therefore 
exposes:
+
+- `GetServiceInfo` → high-level `available_profile_ids` and `supported_steps`.
+- `ListModelBundles` → `ListModelBundlesResponse` containing `ModelBundleInfo` 
entries.
+
+`ModelBundleInfo` / `ModelDescriptor` (see full proto in 11.2–11.3) are 
intended to carry enough metadata for real client discovery:
+
+- `locale` / language.
+- Component types present (e.g. "sentence_detector", "embed").
+- Supported or typical `PipelineStep` values this bundle is intended to serve.
+- Optional free-form capabilities or tags.
+
+Implementations should populate these fields so that a client can list 
bundles, filter by language or capability (e.g. "has an embed component"), and 
then pick a `bundle_id` or `profile_id`. The exact richness of the descriptors 
can grow over time without breaking v1 clients (additive fields only).
+
+In the sandbox implementation we will start with what the existing 
`ModelFinderUtil` + directory scanning can provide and extend it for ONNX 
embedding artifacts (model + vocab pairs) as first-class bundle components.
+
+### 6.3 Profiles
+
+Predefined profiles in server config (e.g. `en-basic`, `en-ner`):
+
+```ini
+profile.en-basic.bundle_id=en-default
+profile.en-basic.steps=SENTENCE_DETECT,TOKENIZE,POS_TAG
+```
+
+`GetServiceInfo` returns available `profile_id` values.
+
+### 6.4 Thread safety
+
+OpenNLP 3.0 documents thread-safe `*ME` classes. The server holds **one 
instance per loaded model** in `ModelBundleCache`, shared across gRPC executor 
threads.
+
+### 6.5 Stateful NER and adaptive data
+
+Certain OpenNLP components (notably `NameFinderME` / `TokenNameFinder`) 
maintain "adaptive data" that can improve consistency *within a single 
document* (e.g., once "John" is tagged as a person early in a long text, later 
mentions of "John" can benefit from that context). 
`TokenNameFinder.clearAdaptiveData()` resets this state.
+
+In the gRPC contract:
+
+- `AnalysisOptions.clear_adaptive_data` (default: `true`) controls whether the 
server calls `clearAdaptiveData()` on applicable components **after** 
processing the current `AnalyzeDocument` request.
+- `true` (the default) ensures that each RPC is independent with respect to 
adaptive state. This matches the common expectation of a stateless 
document-centric API.
+- `false` leaves the adaptive state in the cached `*ME` instance for the 
bundle. A *sequence* of calls from the same logical client/session that target 
the same bundle can therefore benefit from cross-document (but 
within-"session") adaptive hints. This is an advanced, opt-in behavior and is 
not the normal mode for the 1:1 document contract.
+
+### 6.6 Stateless RPC contract
+
+Each `AnalyzeDocument` call is a self-contained, stateless operation on the 
wire: one `raw_text` document in, one enriched `OpenNlpDocument` (plus 
diagnostics) out. There is no session, cursor, or cross-call mutable state in 
the public contract.
+
+Adaptive data (6.5) and any internal caches (model instances, bundle 
resolution) are implementation details of the server-side processor and the 
specific OpenNLP components. They are scoped to a loaded bundle inside the 
server process and do not leak into the protobuf messages or require clients to 
manage server-side sessions.
+
+If a deployment needs stateful document sequences (for example, a long-running 
"conversation" or a large report split across multiple calls that should share 
NER adaptive data), it can do so by:
+
+- Using the same `bundle_id` / profile.
+- Setting `clear_adaptive_data=false` for the duration of the sequence.
+- Managing its own correlation (e.g. via `doc_id` or metadata) and eventually 
calling with `clear_adaptive_data=true` (or a fresh bundle context) to reset.
+
+Cross-client or long-lived shared mutable state across unrelated documents is 
strongly discouraged and outside the intended use of the v1 contract.
+
+---
+
+## 7. Error handling and partial results policy
+
+**gRPC status codes** are used for fatal failures that prevent a useful 
response:
+
+- `INVALID_ARGUMENT`: bad request (missing raw_text, unknown profile_id when 
no inline profile, invalid options, etc.).
+- `NOT_FOUND`: unknown `profile_id` or `bundle_id` that cannot be resolved.
+- `INTERNAL`: unrecoverable model or orchestration error after a step has 
started.
+
+**Per-step diagnostics** are always populated in 
`AnalyzeDocumentResponse.diagnostics` (even on success paths) for observability:
+
+- `INFO`: step skipped because it was not requested or not applicable.
+- `WARNING`: non-fatal issue (e.g. optional NER type had no model in the 
bundle; a provider fell back; low-confidence result).
+- `ERROR`: a step failed but the server chose (or was configured) to continue 
with partial results.
+
+**Partial results policy (addresses community feedback)**:
+
+- The contract favors **useful partial results** for non-critical failures so 
that clients (especially cross-language ones doing RAG pipelines) can still 
make progress.
+- If a **required** step (as determined by the `AnalysisProfile.steps` and 
server policy for that profile - e.g. SENTENCE_DETECT or TOKENIZE when later 
steps depend on them) fails with an ERROR diagnostic, the RPC **fails** with an 
appropriate gRPC status and the diagnostics attached (best-effort document may 
still be returned in the response for debugging, but callers must check status).
+- If an **optional** or **best-effort** step fails (e.g. a particular NER 
entity type, CHUNK when the profile treats it as enrichment, or an EMBED 
provider that is temporarily unavailable), the server returns `OK` (or the 
normal response code) with:
+  - The document populated as far as successful steps reached.
+  - One or more `ProcessingDiagnostic` entries with severity `ERROR` or 
`WARNING` and `component_key` identifying the failing piece (e.g. "ner_person", 
"embed_minilm").
+- Profiles and future `AnalysisOptions` (e.g. a `strict` flag) can influence 
what is treated as required vs. optional. The default is pragmatic partial 
success for enrichment steps.
+- Clients should always inspect `diagnostics` rather than assuming a 
successful status code means every requested step produced perfect output.
+
+This policy is intentionally documented early so that Python/Go/Rust/etc. 
clients written against v1 have predictable behavior. The sandbox 
implementation will include tests that exercise both full-success and 
partial-failure paths.
+
+---
+
+## 8. Versioning and compatibility
+
+- Package path includes `v1`; breaking changes require `v2` package.
+- Use `reserved` for removed fields; never reuse field numbers.
+- `GetServiceInfoResponse.api_version` reports proto/API version string (e.g. 
`"1.0.0"`).
+- Sandbox `opennlp` package services: **not** wire-compatible; migration guide 
in Phase 2.
+
+---
+
+## 9. Security and operations (deployment)
+
+Out of scope for proto, noted for implementers:
+
+- TLS termination at load balancer or Netty server
+- Model path isolation and read-only mounts
+- Resource limits (max `raw_text` size, deadline per RPC)
+- gRPC reflection optional (sandbox supported via config)
+
+---
+
+## 10. Future phases (implementation - not Phase 1)
+
+
+| Phase | Deliverable                                                          
                                                                                
                                                                                
    |
+| ----- | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| 2a    | `opennlp-grpc-api` module, codegen, proto tests (including 
CHUNK/EMBED shapes)                                                             
                                                                                
              |
+| 2b    | Pure-Java processor/orchestrator + `CharSpanMapper` + basic CHUNK 
(sentence+overlap segmentation) + EMBED via `SentenceVectorsDL`; unit tests     
                                                                                
       |
+| 2c    | gRPC server host (delegating to processor), config for 
bundles/profiles, model discovery for both classic + ONNX embedding artifacts, 
integration tests, sandbox port + updated Python example                        
                   |
+| 2d    | Graduate modules to `apache/opennlp` (after community review); 
optional core `Document` interface alignment if not already landed in 3.0.0-M4  
                                                                                
          |
+| 3     | Provider SPI hardening; first CUDA (`opennlp-dl-gpu`) and OpenVINO 
provider modules as separate optional builds; hot-swap / priority selection 
examples                                                                        
          |
+| 4     | Richer bundle discovery (languages, supported steps per bundle in 
`ModelBundleInfo`), streaming variants of AnalyzeDocument or dedicated 
chunk/embed streams, additional steps (PARSE, LEMMATIZE, SENTIMENT, etc.) if 
not already in 2.x |
+| Later | Advanced semantic chunking driven by embeddings, more inference 
backends (DJL direct, remote KServe via the same provider abstraction), Buf 
Schema Registry publication, official multi-language client examples beyond 
Python           |
+
+
+---
+
+## 11. Full protobuf definitions (Phase 1 deliverable)
+
+**Note:** The authoritative .proto sources that will be used to start the 
sandbox implementation
+live under `opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/`. The 
blocks below
+are the canonical text form for the RFC and are kept in sync with those files.
+
+### 11.1 `opennlp_document.proto`
+
+```protobuf
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// Canonical 1:1 NLP document: text in, annotations out.
+// The base linguistic structure (sentences + tokens + entities + classic 
chunks etc.)
+// is the shared "backbone" - computed once when any CHUNK or EMBED step is 
active.
+// Multiple independent groups of (chunks + embeddings) can be produced in one
+// AnalyzeDocument run. Shared NLP analysis is computed once; each group is
+// traceable by its chunk/embedding config or profile fragment and attaches
+// vectors via source spans (a chunk or sentence region can have embeddings
+// from more than one model).
+message OpenNlpDocument {
+  string doc_id = 1;
+  string raw_text = 2;
+  optional string detected_language = 3;
+  optional float language_confidence = 4;
+  repeated AnnotatedSentence sentences = 5;
+  optional DocumentAnalytics analytics = 6;
+  map<string, string> metadata = 7;
+  repeated EmbeddingResult embeddings = 8;  // denormalized "all embeddings 
with spans" view (optional convenience)
+  optional DocumentClassification classification = 9;
+
+  // Multiple chunk + embedding result groups from this analysis (the primary
+  // way to carry more than one chunking strategy in one document).
+  // Shared linguistic backbone (sentences above) is computed once; each group
+  // applies its chunking strategy and named embedding models independently.
+  repeated ChunkEmbeddingGroup chunk_embedding_groups = 10;
+}
+
+// A named, traceable group of chunks produced by one chunking strategy,
+// with the requested embedding models attached *inside* each chunk.
+//
+// Chunking strategy lives at the group level. Embeddings live inside the 
chunks
+// (multiple models per chunk, as named in the corresponding 
ChunkEmbedConfigEntry
+// for this strategy). Chunking always precedes the embedding attachment in 
the model.
+//
+// This gives the "repeat the actual chunk" (text + span) with its embeddings
+// as facets inside it.
+message ChunkEmbeddingGroup {
+  // Stable identifier for this group (from the request's config_id).
+  string group_id = 1;
+
+  // The chunking configuration / strategy that produced these chunks.
+  optional string chunk_config_id = 2;
+
+  // The embedding model IDs that were explicitly requested for this chunking
+  // strategy (copied from the request entry for traceability). Each Chunk in
+  // this group will have exactly these models' EmbeddingResult entries inside 
it
+  // (or a subset if some failed with diagnostics).
+  repeated string embedding_model_ids = 3;
+
+  // Optional human name for the result set.
+  optional string result_set_name = 4;
+
+  // The chunks for this group. Each chunk carries its span (into raw_text),
+  // optional tag, the actual text content (repeated for client convenience),
+  // and the multiple embedding models attached directly to it.
+  repeated Chunk chunks = 5;
+
+  // Optional per-group metadata (timing, counts, provenance, etc.).
+  map<string, string> metadata = 6;
+
+  // Primary granularity for the chunks/vectors in this group (CHUNK for
+  // segmentation-style groups, SENTENCE, etc.).
+  optional EmbeddingGranularity granularity = 7;
+}
+
+// A chunk (segmentation or otherwise) with its embeddings attached inside.
+// This is the "chunk owns its embedding models" shape (chunking first).
+message Chunk {
+  CharSpan char_span = 1;
+  optional string chunk_tag = 2;
+
+  // The text content of the chunk (substring of the document's raw_text).
+  // Repeated for convenience so clients (especially non-Java) do not have to
+  // slice the original document text. The authoritative location is still
+  // given by char_span over the top-level raw_text.
+  optional string text_content = 3;
+
+  // The embeddings for this chunk from the models named for the containing 
group.
+  // Multiple models per chunk are supported and expected when the chunking
+  // strategy requested several embedding_model_ids.
+  repeated EmbeddingResult embeddings = 4;
+
+  // Optional: if this chunk overlaps or contains specific sentences, the
+  // indices (0-based into the document's sentences list) can be recorded here
+  // for easy navigation without re-computing overlaps from spans.
+  repeated int32 contained_sentence_indices = 5;
+}
+
+message AnnotatedSentence {
+  CharSpan sentence_span = 1;
+  repeated Token tokens = 2;
+  repeated NamedEntity entities = 3;
+  optional ChunkResult syntactic_chunks = 4;
+  optional ParseTree parse_tree = 5;
+  optional string sentiment_label = 6;
+  optional float sentiment_confidence = 7;
+}
+
+message Token {
+  string text = 1;
+  CharSpan char_span = 2;
+  optional string pos_tag = 3;
+  optional string lemma = 4;
+  optional float pos_probability = 5;
+}
+
+message NamedEntity {
+  CharSpan char_span = 1;
+  string entity_type = 2;
+  optional double probability = 3;
+}
+
+message CharSpan {
+  int32 start = 1;
+  int32 end = 2;
+  CoordinateSpace space = 3;
+  optional string type = 4;
+  optional double probability = 5;
+}
+
+enum CoordinateSpace {
+  COORDINATE_SPACE_UNSPECIFIED = 0;
+  CHAR_DOCUMENT = 1;
+  TOKEN_SENTENCE = 2;
+}
+
+message DocumentAnalytics {
+  int32 total_tokens = 1;
+  int32 total_sentences = 2;
+  float noun_density = 3;
+  float verb_density = 4;
+  float adjective_density = 5;
+  float adverb_density = 6;
+  float content_word_ratio = 7;
+  int32 unique_lemma_count = 8;
+  float lexical_density = 9;
+}
+
+// Lightweight syntactic chunks (e.g. from ChunkerME) attached per 
AnnotatedSentence.
+// Distinct from the configurable segmentation chunks in ChunkEmbeddingGroup.
+message ChunkResult {
+  repeated ChunkSpan chunks = 1;
+}
+
+message ChunkSpan {
+  CharSpan char_span = 1;
+  string chunk_tag = 2;
+}
+
+message ParseTree {
+  ParseNode root = 1;
+}
+
+message ParseNode {
+  string label = 1;
+  CharSpan span = 2;
+  repeated ParseNode children = 3;
+  optional double probability = 4;
+}
+
+message EmbeddingResult {
+  string model_id = 1;
+  repeated float vector = 2;
+  CharSpan source_span = 3;
+  EmbeddingGranularity granularity = 4;
+}
+
+enum EmbeddingGranularity {
+  EMBEDDING_GRANULARITY_UNSPECIFIED = 0;
+  DOCUMENT = 1;
+  SENTENCE = 2;
+  // Embeddings attached to (segmentation or syntactic) chunks produced by a 
CHUNK step
+  // or by a ChunkEmbeddingGroup. This enables the "one chunk, multiple 
embedding aspects"
+  // and multi-group use case (different chunker configs or embed models can 
each
+  // produce their own group with CHUNK-granularity vectors).
+  CHUNK = 3;
+  // Future: paragraph, section, or custom spans. Consumers should match on 
this
+  // enum (plus group metadata) rather than string parsing of config ids.
+  reserved 4 to 10;
+}
+
+message DocumentClassification {
+  string best_category = 1;
+  map<string, double> category_scores = 2;
+}
+```
+
+### 11.2 `opennlp_pipeline.proto`
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "opennlp_document.proto";
+
+enum PipelineStep {
+  PIPELINE_STEP_UNSPECIFIED = 0;
+  LANGUAGE_DETECT = 1;
+  SENTENCE_DETECT = 2;
+  TOKENIZE = 3;
+  POS_TAG = 4;
+  NER = 5;
+  CHUNK = 6;
+  PARSE = 7;
+  LEMMATIZE = 8;
+  DOC_CATEGORIZE = 9;
+  SENTIMENT = 10;
+  EMBED = 11;
+}
+
+enum POSTagFormat {
+  POS_TAG_FORMAT_UNSPECIFIED = 0;
+  UD = 1;
+  PENN = 2;
+  CUSTOM = 3;
+}
+
+enum InferenceBackend {
+  INFERENCE_BACKEND_UNSPECIFIED = 0;
+  OPENNLP_ME = 1;
+  ONNX_RUNTIME = 2;
+  ONNX_RUNTIME_GPU = 3;
+  reserved 4 to 9;
+  reserved "OPENVINO", "DJL";
+}
+
+message AnalysisProfile {
+  string profile_id = 1;
+  repeated PipelineStep steps = 2;
+  ModelBundleRef model_bundle = 3;
+  POSTagFormat pos_tag_format = 4;
+  repeated string ner_entity_types = 5;
+}
+
+message ModelBundleRef {
+  string bundle_id = 1;
+  map<string, string> component_keys = 2;
+}
+
+message AnalysisOptions {
+  bool include_probabilities = 1;
+  bool clear_adaptive_data = 2;
+  InferenceBackend inference_backend = 3;
+  optional int32 max_text_length = 4;
+  optional string onnx_embedding_model_id = 5;
+}
+
+message ModelDescriptor {
+  string hash = 1;
+  string name = 2;
+  string locale = 3;
+  string component_type = 4;
+  // Discovery aids (additive; populated by server for ListModelBundles)
+  repeated string languages = 5;           // e.g. ["en", "eng"]
+  repeated PipelineStep supported_steps = 6;
+  map<string, string> attributes = 7;      // free-form (e.g. "dim":"384", 
"task":"embed")
+}
+
+message ModelBundleInfo {
+  string bundle_id = 1;
+  repeated ModelDescriptor models = 2;
+  // Optional aggregated view for convenience
+  repeated string supported_languages = 3;
+  repeated PipelineStep supported_steps = 4;
+}
+```
+
+### 11.3 `opennlp_service.proto`
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "opennlp_document.proto";
+import "opennlp_pipeline.proto";
+
+service OpenNlpAnalysisService {
+  rpc AnalyzeDocument(AnalyzeDocumentRequest) returns 
(AnalyzeDocumentResponse);
+  rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+  rpc ListModelBundles(ListModelBundlesRequest) returns 
(ListModelBundlesResponse);
+}
+
+message AnalyzeDocumentRequest {
+  OpenNlpDocument document = 1;
+  // Single profile (classic usage). When multi-config is used (see below),
+  // this may be omitted or treated as a default/base profile.
+  AnalysisProfile profile = 2;
+  AnalysisOptions options = 3;
+  optional string profile_id = 4;
+
+// Multiple chunk/embed (or full profile) configurations for a single run.
+//
+// Chunking strategy always comes first. For each chunking strategy (config 
entry)
+// the caller explicitly names the embedding models to apply to the chunks 
produced
+// by that strategy. This is *not* an automatic full NxM cartesian product 
across
+// all chunkers and all embedders unless the caller requests it.
+//
+// The server runs the common linguistic pipeline steps (SENTENCE_DETECT, 
TOKENIZE,
+// POS, NER, etc.) only once (shared base structure in AnnotatedSentence), then
+// for each requested chunking strategy produces a ChunkEmbeddingGroup 
containing
+// the chunks (with the actual chunk text/span repeated for convenience) and 
the
+// requested embeddings attached *inside* each chunk.
+//
+// See Chunk and ChunkEmbeddingGroup below.
+repeated ChunkEmbedConfigEntry chunk_embed_configs = 5;
+}
+
+// Entry for a single chunking strategy + the specific embeddings wanted for 
it.
+// Chunking first; embeddings are named per strategy.
+message ChunkEmbedConfigEntry {
+  // Stable id for the group that will be produced (becomes group_id).
+  // Example: "body-token-512" or "title-sentences".
+  string config_id = 1;
+
+  // Optional display name for the result set (e.g. "body_chunks_minilm").
+  optional string result_set_name = 2;
+
+  // Full profile (if you need complex step composition) OR the lightweight 
chunking spec.
+  optional AnalysisProfile profile = 3;
+  optional ChunkingSpec chunking = 4;
+
+  // Explicit list of embedding models to run for the chunks of *this* 
chunking strategy.
+  // The reply will attach exactly these models' vectors inside each Chunk in 
the group.
+  // Order here can be used to order the repeated embeddings on each chunk if 
desired.
+  repeated string embedding_model_ids = 5;
+}
+
+message AnalyzeDocumentResponse {
+  OpenNlpDocument document = 1;
+  repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+  PipelineStep step = 1;
+  string message = 2;
+  DiagnosticSeverity severity = 3;
+  optional string component_key = 4;
+}
+
+enum DiagnosticSeverity {
+  DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+  INFO = 1;
+  WARNING = 2;
+  ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+
+message GetServiceInfoResponse {
+  string opennlp_version = 1;
+  string api_version = 2;
+  repeated string available_profile_ids = 3;
+  repeated PipelineStep supported_steps = 4;
+}
+
+message ListModelBundlesRequest {}
+
+message ListModelBundlesResponse {
+  repeated ModelBundleInfo bundles = 1;
+}
+```
+
+### 11.4 Reserved legacy package (optional, Phase 2 discussion)
+
+If PMC requires sandbox compatibility, move existing services to:
+
+`package org.apache.opennlp.grpc.legacy.v1;` - **unchanged wire format** from 
sandbox for one release, deprecated in favor of `OpenNlpAnalysisService`.
+
+---
+
+## 12. Example request/response
+
+### 12.1 Basic profile request (JSON representation for documentation)
+
+```json
+{
+  "document": {
+    "doc_id": "doc-001",
+    "raw_text": "John works at OpenNLP in New York.",
+    "metadata": { "source": "example" }
+  },
+  "profile_id": "en-basic",
+  "options": {
+    "include_probabilities": true,
+    "clear_adaptive_data": true
+  }
+}
+```
+
+### 12.2 Multi-group chunk + embed request
+
+Two chunking strategies, each with explicitly named embedding models (not an 
automatic cartesian product):
+
+```json
+{
+  "document": {
+    "doc_id": "doc-002",
+    "raw_text": "John works at OpenNLP in New York. The team builds NLP tools."
+  },
+  "profile_id": "en-basic",
+  "chunk_embed_configs": [
+    {
+      "config_id": "sentence-chunks",
+      "chunking": { "strategy": "SENTENCE" },
+      "embedding_model_ids": ["minilm-l6-v2"]
+    },
+    {
+      "config_id": "fixed-window",
+      "chunking": { "strategy": "FIXED_CHAR", "max_chars": 128, 
"overlap_chars": 16 },
+      "embedding_model_ids": ["minilm-l6-v2", "e5-small"]
+    }
+  ]
+}
+```
+
+### 12.3 Response (excerpt - multi-group)
+
+```json
+{
+  "document": {
+    "doc_id": "doc-002",
+    "raw_text": "John works at OpenNLP in New York. The team builds NLP 
tools.",
+    "detected_language": "eng",
+    "sentences": [
+      {
+        "sentence_span": { "start": 0, "end": 38, "space": "CHAR_DOCUMENT" },
+        "tokens": [
+          { "text": "John", "char_span": { "start": 0, "end": 4, "space": 
"CHAR_DOCUMENT" }, "pos_tag": "PROPN" }
+        ],
+        "entities": [
+          { "char_span": { "start": 0, "end": 4, "space": "CHAR_DOCUMENT" }, 
"entity_type": "person" }
+        ]
+      }
+    ],
+    "chunk_embedding_groups": [
+      {
+        "group_id": "sentence-chunks",
+        "chunk_config_id": "sentence-chunks",
+        "embedding_model_ids": ["minilm-l6-v2"],
+        "chunks": [
+          {
+            "char_span": { "start": 0, "end": 38, "space": "CHAR_DOCUMENT" },
+            "text_content": "John works at OpenNLP in New York.",
+            "embeddings": [
+              {
+                "model_id": "minilm-l6-v2",
+                "vector": [0.12, -0.04, 0.33],
+                "source_span": { "start": 0, "end": 38, "space": 
"CHAR_DOCUMENT" },
+                "granularity": "CHUNK"
+              }
+            ]
+          }
+        ]
+      },
+      {
+        "group_id": "fixed-window",
+        "chunk_config_id": "fixed-window",
+        "embedding_model_ids": ["minilm-l6-v2", "e5-small"],
+        "chunks": [
+          {
+            "char_span": { "start": 0, "end": 64, "space": "CHAR_DOCUMENT" },
+            "text_content": "John works at OpenNLP in New York. The team 
builds NLP tools.",
+            "embeddings": [
+              { "model_id": "minilm-l6-v2", "vector": [0.11, -0.03, 0.31], 
"granularity": "CHUNK" },
+              { "model_id": "e5-small", "vector": [0.09, 0.02, 0.28], 
"granularity": "CHUNK" }
+            ]
+          }
+        ]
+      }
+    ]
+  },
+  "diagnostics": []
+}
+```
+
+---
+
+## 13. Mapping to Java API (implementation reference)
+
+
+| PipelineStep    | Java type                                         |
+| --------------- | ------------------------------------------------- |
+| LANGUAGE_DETECT | `LanguageDetectorME`                              |
+| SENTENCE_DETECT | `SentenceDetectorME`                              |
+| TOKENIZE        | `TokenizerME`                                     |
+| POS_TAG         | `POSTaggerME`                                     |
+| NER             | `NameFinderME` (per type)                         |
+| CHUNK           | `ChunkerME`                                       |
+| PARSE           | `Parser`                                          |
+| LEMMATIZE       | `LemmatizerME`                                    |
+| DOC_CATEGORIZE  | `DocumentCategorizerME` / `DocumentCategorizerDL` |
+| SENTIMENT       | `SentimentME`                                     |
+| EMBED           | `SentenceVectorsDL`                               |
+
+
+---
+
+## 14. Open questions
+
+1. Maximum `raw_text` size - fixed limit vs streaming (streaming deferred).
+2. `profile_id` vs inline `AnalysisProfile` - both supported; precedence rule: 
inline overrides server profile when `profile_id` also set?
+3. Batch RPC `AnalyzeDocuments` for throughput - v1 or v2?
+4. Publish protos to Buf BSR under `buf.build/apache/opennlp`?
+
+---
+
+## 15. Changelog
+
+
+| Version | Date       | Changes                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |
+| ------- | ---------- | 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| 0.5     | 2026-06-06 | Expand conversational Summary: motivation, what the 
document-centric gRPC API unlocks (polyglot integration, streaming, shared 
infrastructure, search/RAG). |
+| 0.4     | 2026-06-06 | Restructure into Part I (overview + target 
architecture diagram) and Part II (specs). Remove external platform references. 
Clarify chunk-first / embeddings-inside-chunk model. Fix duplicate proto 
appendix messages. Add multi-group §12 examples.                                
                                                                                
                                                                                
          |
+| 0.3     | 2026-06-06 | Canonical sandbox doc; multi-group chunk+embed 
(`ChunkEmbeddingGroup`, `ChunkEmbedConfigEntry`, `EmbeddingGranularity.CHUNK`). 
                                                                                
                                                                                
                                                                                
                                                                               |
+| 0.2     | 2026-06-06 | Incorporate initial dev@ feedback (Martin Wiesner, 
Richard Zowalla): neutral core `Document` interface proposal; sandbox-first + 
Maven only; retain legacy granular services; target 3.1.x more likely; make 
CHUNK + EMBED explicit v1 with GPU hot-swap provider story; expand ModelBundle 
discovery; define partial-results policy; clarify stateless contract vs. 
adaptive data; update goals, background, phases, and add Community RFC feedback 
section. |
+| 0.1     | 2026-05-21 | Initial Phase 1 design + full protos                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |
+
+
diff --git a/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md 
b/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md
new file mode 100644
index 00000000..0e27028b
--- /dev/null
+++ b/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md
@@ -0,0 +1,233 @@
+# JIRA Proposal: Document-Centric gRPC API for Apache OpenNLP 3.x
+
+> **Copy-paste guide:** Use the sections below when filing an issue at 
https://issues.apache.org/jira/projects/OPENNLP  
+> **Issue type:** Improvement / New Feature  
+> **Component:** (suggest) `grpc` or `server` if available; otherwise leave 
default  
+> **Affects Version:** 3.0.0-SNAPSHOT  
+> **Labels:** `grpc`, `rfc`, `api-design`
+
+---
+
+## Summary (JIRA title field)
+
+**Add document-centric gRPC API - evolve opennlp-sandbox POC with canonical 
OpenNlpDocument and AnalyzeDocument RPC**
+
+---
+
+## Description (paste into JIRA description)
+
+### Problem
+
+Apache OpenNLP is primarily an **in-process Java library** (API, CLI, UIMA). 
The README notes embedding in distributed pipelines (Flink, NiFi, Spark), but 
there is **no standard wire contract** for cross-language clients or remote 
inference.
+
+A proof-of-concept exists in the sandbox:
+
+- **Repository:** 
https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc
+- **Current scope:** Three separate gRPC services (`SentenceDetectorService`, 
`TokenizerTaggerService`, `PosTaggerService`) with string-based requests and 
`model_hash` per call
+- **Gap:** No unified **document** message, no pipeline orchestration (the POC 
has three separate string-based services), and clients must chain multiple 
RPCs. The proposal brings NER, chunking (configurable segmentation + classic 
ChunkerME), and embeddings (via SentenceVectorsDL + pluggable GPU providers) 
into the single document-centric contract as first-class steps.
+
+Main OpenNLP (`apache/opennlp`) has **no gRPC modules** on `main`. OpenNLP 3.0 
brings thread-safe `*ME` classes (JDK 21+), which makes a long-lived gRPC 
server practical. The `opennlp-dl` / `opennlp-dl-gpu` modules already support 
ONNX inference (including sentence embeddings via `SentenceVectorsDL`).
+
+### Proposal
+
+Evolve the sandbox POC into ASF-native modules (target: main repo after 
consensus):
+
+| Module | Purpose |
+|--------|---------|
+| `opennlp-grpc-api` | Protocol Buffers + generated stubs (Java first; 
descriptors for other languages) |
+| `opennlp-grpc-server` | gRPC server, model bundle registry, pipeline 
orchestration |
+| `opennlp-grpc-examples` | Sample clients (e.g. Python) |
+
+**Core API change:** Introduce a canonical **`OpenNlpDocument`** message (1:1 
text document in, enriched document out) and a primary **`AnalyzeDocument`** 
RPC that runs a configurable NLP pipeline server-side-similar in spirit to the 
existing UIMA `OpenNlpTextAnalyzer` composite, but as a language-neutral 
contract.
+
+**Package naming (proposed):** `org.apache.opennlp.grpc.v1`
+
+### Non-goals (v1 RFC)
+
+- Binary/PDF document parsing (Tika, etc.) - callers supply `raw_text`
+- Training, evaluation, or model-update RPCs
+- Embedding `.bin` model bytes in request messages (models remain server-side)
+- Authentication / multi-tenancy in the core API (deployment concern: mTLS, 
reverse proxy)
+- Coreference (documented in manual but not implemented in current codebase)
+
+### Compatibility
+
+- **Additive** Maven modules; no breaking changes to `opennlp-api` / 
`opennlp-runtime`
+- Sandbox granular services may be deprecated or moved to `opennlp.legacy.v1` 
after migration
+
+### Phased delivery (high level)
+
+| Phase | Scope |
+|-------|--------|
+| **0** | This JIRA + community RFC (this ticket) |
+| **1** | Design document + full `.proto` definitions (no server code required 
for consensus) |
+| **2+** | Implementation: orchestrator, server, tests, graduation from 
sandbox to main repo |
+| **Later** | Advanced GPU provider modules (CUDA via onnxruntime-gpu, 
OpenVINO), richer discovery, streaming, additional steps; core `Document` 
interface graduation if not in 3.0.0-M4 |
+
+### Design highlights
+
+1. **Three proto layers (NLP-only):** domain types (`OpenNlpDocument`), 
pipeline config (`AnalysisProfile`), service (`OpenNlpAnalysisService`)
+2. **Offset contract:** All exported spans use **character offsets in the 
original `raw_text`** (`CHAR_DOCUMENT`), half-open `[start, end)` matching 
`opennlp.tools.util.Span`
+3. **Model bundles:** Replace per-RPC `model_hash` with `ModelBundleRef` + 
server-defined profiles (reuse sandbox model discovery patterns)
+4. **Thread safety:** Leverage OpenNLP 3.0 thread-safe `*ME` instances cached 
per model bundle
+
+### Sample protobuf (illustrative - full spec in design doc)
+
+The following is a **short sketch** for discussion; field numbers and optional 
messages may change during RFC.
+
+**Important (per community feedback on OPENNLP-1833):** Chunking and 
embeddings are **in scope for v1**, not deferred. The full protobuf definitions 
(including `PipelineStep.CHUNK` and `EMBED`, `ChunkResult`/`ChunkSpan`, 
`EmbeddingResult`, `InferenceBackend`, richer `ModelBundleInfo` for discovery, 
etc.) live in the companion design document `docs/rfc/opennlp-grpc-design.md`. 
The short sketch below is intentionally minimal. GPU hot-swap (CUDA, OpenVINO) 
is achieved via a provider SPI beh [...]
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// --- Layer 1: Document ---
+
+message OpenNlpDocument {
+  string doc_id = 1;
+  string raw_text = 2;
+  optional string detected_language = 3;
+  optional float language_confidence = 4;
+  repeated AnnotatedSentence sentences = 5;
+  map<string, string> metadata = 6;
+}
+
+message AnnotatedSentence {
+  CharSpan sentence_span = 1;
+  repeated Token tokens = 2;
+  repeated NamedEntity entities = 3;
+}
+
+message Token {
+  string text = 1;
+  CharSpan char_span = 2;
+  optional string pos_tag = 3;
+}
+
+message NamedEntity {
+  CharSpan char_span = 1;
+  string entity_type = 2;
+  optional double prob = 3;
+}
+
+message CharSpan {
+  int32 start = 1;
+  int32 end = 2;
+  CoordinateSpace space = 3;
+  optional string type = 4;
+  optional double prob = 5;
+}
+
+enum CoordinateSpace {
+  COORDINATE_SPACE_UNSPECIFIED = 0;
+  CHAR_DOCUMENT = 1;
+}
+
+// --- Layer 2: Pipeline ---
+
+enum PipelineStep {
+  PIPELINE_STEP_UNSPECIFIED = 0;
+  LANGUAGE_DETECT = 1;
+  SENTENCE_DETECT = 2;
+  TOKENIZE = 3;
+  POS_TAG = 4;
+  NER = 5;
+}
+
+message AnalysisProfile {
+  string profile_id = 1;
+  repeated PipelineStep steps = 2;
+  ModelBundleRef model_bundle = 3;
+}
+
+message ModelBundleRef {
+  string bundle_id = 1;
+}
+
+message AnalysisOptions {
+  bool include_probabilities = 1;
+  bool clear_adaptive_data = 2;
+}
+
+// --- Layer 3: Service ---
+
+service OpenNlpAnalysisService {
+  rpc AnalyzeDocument(AnalyzeDocumentRequest) returns 
(AnalyzeDocumentResponse);
+  rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+}
+
+message AnalyzeDocumentRequest {
+  OpenNlpDocument document = 1;
+  AnalysisProfile profile = 2;
+  AnalysisOptions options = 3;
+}
+
+message AnalyzeDocumentResponse {
+  OpenNlpDocument document = 1;
+  repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+  PipelineStep step = 1;
+  string message = 2;
+  DiagnosticSeverity severity = 3;
+}
+
+enum DiagnosticSeverity {
+  DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+  INFO = 1;
+  WARNING = 2;
+  ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+message GetServiceInfoResponse {
+  string opennlp_version = 1;
+  string api_version = 2;
+  repeated string available_profile_ids = 3;
+}
+```
+
+### Comparison: sandbox vs proposed
+
+| Aspect | Sandbox POC | Proposed |
+|--------|---------------|----------|
+| Services | 3 (sent / token / POS) | 1 primary (`OpenNlpAnalysisService`) |
+| I/O | Strings + `StringList` | `OpenNlpDocument` |
+| Models | `model_hash` per RPC | `ModelBundleRef` + profiles |
+| Pipeline | Client-side chaining | Server-side `AnalysisProfile` |
+| Package | `package opennlp` | `org.apache.opennlp.grpc.v1` |
+
+### References
+
+- Sandbox POC: https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc
+- Current sandbox proto: 
https://github.com/apache/opennlp-sandbox/blob/main/opennlp-grpc/opennlp-grpc-api/opennlp.proto
+- UIMA composite pipeline: 
`opennlp-extensions/opennlp-uima/descriptors/OpenNlpTextAnalyzer.xml`
+- ONNX / GPU: `opennlp-dl`, `opennlp-dl-gpu`, `SentenceVectorsDL`
+- Full design document (companion): `docs/rfc/opennlp-grpc-design.md` in 
contributor branch or attachment
+
+### Questions for the community (with initial feedback summary)
+
+1. Should v1 expose **only** `AnalyzeDocument`, or retain sandbox granular 
RPCs under a legacy package?
+   - **Community preference (Martin + consensus direction):** Retain the 
existing granular services under a legacy package 
(`org.apache.opennlp.grpc.legacy.v1` or similar) for a transition period. New 
development and clients should use the primary document-centric 
`OpenNlpAnalysisService`.
+
+2. Target release: **3.0.x** (additive) vs **3.1**?
+   - **Community view (Martin):** More likely **3.1.x**. 3.0.0 is approaching 
a release (target end of June / early July 2026 or shortly thereafter). The 
gRPC work is substantial and additive; landing it after the 3.0 cut reduces 
risk.
+
+3. Preferred home: graduate into **apache/opennlp** vs remain in 
**opennlp-sandbox** until stable?
+   - **Community direction (Martin):** Start and iterate in the 
**opennlp-sandbox** (as is already underway on the feature branch). Graduate 
stable modules into `apache/opennlp` in future cycles once the design has had 
review and the implementation has proven itself. A neutral core `Document` 
interface (if adopted) could land earlier in 3.0.0-M4 as a small additive API 
change.
+
+4. Proto tooling: Maven `protobuf-maven-plugin` only, or also publish to Buf 
Schema Registry?
+   - **Strong community preference (Martin, Richard):** Stay with **Maven + 
protobuf-maven-plugin** only for consistency with the rest of the OpenNLP 
project. No Gradle. Buf publication can be considered later as a non-blocking 
enhancement.
+
+---
+
+## Reporter notes (do not paste)
+
+- Attach or link `docs/rfc/opennlp-grpc-design.md` when available
+- Discuss on [email protected] after filing
+- Link this JIRA from any sandbox PR that implements the new protos
diff --git 
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
new file mode 100644
index 00000000..c6e9e2f6
--- /dev/null
+++ 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// Canonical 1:1 NLP document: text in, annotations out.
+// The sentences list provides the shared base linguistic analysis
+// (computed once even when multiple chunk+embed groups are requested).
+message OpenNlpDocument {
+  string doc_id = 1;
+  string raw_text = 2;
+  optional string detected_language = 3;
+  optional float language_confidence = 4;
+  repeated AnnotatedSentence sentences = 5;
+  optional DocumentAnalytics analytics = 6;
+  map<string, string> metadata = 7;
+  repeated EmbeddingResult embeddings = 8;  // denormalized convenience "all 
vectors + spans"
+  optional DocumentClassification classification = 9;
+
+  // Primary way to carry multiple independent chunk+embedding result groups
+  // from one analysis. Each group corresponds to one chunking strategy with
+  // its explicitly requested embedding models attached inside the chunks.
+  repeated ChunkEmbeddingGroup chunk_embedding_groups = 10;
+}
+
+message AnnotatedSentence {
+  CharSpan sentence_span = 1;
+  repeated Token tokens = 2;
+  repeated NamedEntity entities = 3;
+  // Classic syntactic chunks (from ChunkerME or similar), per sentence.
+  optional ChunkResult syntactic_chunks = 4;
+  optional ParseTree parse_tree = 5;
+  optional string sentiment_label = 6;
+  optional float sentiment_confidence = 7;
+}
+
+message Token {
+  string text = 1;
+  CharSpan char_span = 2;
+  optional string pos_tag = 3;
+  optional string lemma = 4;
+  optional float pos_probability = 5;
+}
+
+message NamedEntity {
+  CharSpan char_span = 1;
+  string entity_type = 2;
+  optional double probability = 3;
+}
+
+message CharSpan {
+  int32 start = 1;
+  int32 end = 2;
+  CoordinateSpace space = 3;
+  optional string type = 4;
+  optional double probability = 5;
+}
+
+enum CoordinateSpace {
+  COORDINATE_SPACE_UNSPECIFIED = 0;
+  CHAR_DOCUMENT = 1;
+  TOKEN_SENTENCE = 2;
+}
+
+message DocumentAnalytics {
+  int32 total_tokens = 1;
+  int32 total_sentences = 2;
+  float noun_density = 3;
+  float verb_density = 4;
+  float adjective_density = 5;
+  float adverb_density = 6;
+  float content_word_ratio = 7;
+  int32 unique_lemma_count = 8;
+  float lexical_density = 9;
+}
+
+// Lightweight result for classic syntactic chunking (ChunkerME style)
+// attached to sentences. Distinct from the strategy-driven chunks below.
+message ChunkResult {
+  repeated ChunkSpan chunks = 1;
+}
+
+message ChunkSpan {
+  CharSpan char_span = 1;
+  string chunk_tag = 2;
+}
+
+// A chunk produced by a chunking strategy. The strategy and the list of
+// embedding models are declared on the containing ChunkEmbeddingGroup.
+// Embeddings for the requested models are carried inside this chunk.
+message Chunk {
+  CharSpan char_span = 1;
+  optional string chunk_tag = 2;
+  // The text of the chunk (for client convenience; authoritative bounds
+  // are given by char_span over the document raw_text).
+  optional string text_content = 3;
+  // Multiple embedding models per chunk, as requested for the group/strategy.
+  repeated EmbeddingResult embeddings = 4;
+  // Optional navigation aid: indices into the document's sentences list.
+  repeated int32 contained_sentence_indices = 5;
+}
+
+// One chunking strategy's output: the chunks (with their embeddings inside)
+// plus traceability for the strategy and the exact embedding models that
+// were asked for this strategy in the request.
+message ChunkEmbeddingGroup {
+  string group_id = 1;
+  optional string chunk_config_id = 2;
+  repeated string embedding_model_ids = 3;  // exactly as named for this 
strategy
+  optional string result_set_name = 4;
+  repeated Chunk chunks = 5;
+  map<string, string> metadata = 6;
+  optional EmbeddingGranularity granularity = 7;
+}
+
+message ParseTree {
+  ParseNode root = 1;
+}
+
+message ParseNode {
+  string label = 1;
+  CharSpan span = 2;
+  repeated ParseNode children = 3;
+  optional double probability = 4;
+}
+
+message EmbeddingResult {
+  string model_id = 1;
+  repeated float vector = 2;
+  CharSpan source_span = 3;
+  EmbeddingGranularity granularity = 4;
+}
+
+enum EmbeddingGranularity {
+  EMBEDDING_GRANULARITY_UNSPECIFIED = 0;
+  DOCUMENT = 1;
+  SENTENCE = 2;
+  CHUNK = 3;  // embeddings attached to chunks of a strategy/group
+  reserved 4 to 10;
+}
+
+message DocumentClassification {
+  string best_category = 1;
+  map<string, double> category_scores = 2;
+}
diff --git 
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
new file mode 100644
index 00000000..39a7fef3
--- /dev/null
+++ 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "org/apache/opennlp/grpc/v1/opennlp_document.proto";
+
+// Pipeline steps supported by AnalyzeDocument (and future streaming variants).
+enum PipelineStep {
+  PIPELINE_STEP_UNSPECIFIED = 0;
+  LANGUAGE_DETECT = 1;
+  SENTENCE_DETECT = 2;
+  TOKENIZE = 3;
+  POS_TAG = 4;
+  NER = 5;
+  CHUNK = 6;     // segmentation-style or classic syntactic chunking
+  PARSE = 7;
+  LEMMATIZE = 8;
+  DOC_CATEGORIZE = 9;
+  SENTIMENT = 10;
+  EMBED = 11;
+}
+
+// Configuration for a chunking strategy (used when the caller does not
+// supply a full AnalysisProfile for the entry).
+message ChunkingSpec {
+  // Algorithm: token, sentence, character, semantic (future), etc.
+  string algorithm = 1;           // e.g. "token", "sentence"
+  int32 chunk_size = 2;
+  int32 chunk_overlap = 3;
+  bool clean_text = 4;
+  bool preserve_urls = 5;
+  // For semantic chunking (topic boundaries via embeddings).
+  optional SemanticChunkingConfig semantic_config = 6;
+}
+
+message SemanticChunkingConfig {
+  float similarity_threshold = 1;
+  int32 percentile_threshold = 2;
+  int32 min_chunk_sentences = 3;
+  int32 max_chunk_sentences = 4;
+}
+
+enum POSTagFormat {
+  POS_TAG_FORMAT_UNSPECIFIED = 0;
+  UD = 1;
+  PENN = 2;
+  CUSTOM = 3;
+}
+
+enum InferenceBackend {
+  INFERENCE_BACKEND_UNSPECIFIED = 0;
+  OPENNLP_ME = 1;          // classic *ME
+  ONNX_RUNTIME = 2;
+  ONNX_RUNTIME_GPU = 3;    // CUDA etc. via onnxruntime-gpu (opennlp-dl-gpu)
+  // OpenVINO / DJL / other providers are reserved for separate optional 
modules.
+  reserved 4 to 9;
+  reserved "OPENVINO", "DJL";
+}
+
+message AnalysisProfile {
+  string profile_id = 1;
+  repeated PipelineStep steps = 2;
+  ModelBundleRef model_bundle = 3;
+  POSTagFormat pos_tag_format = 4;
+  repeated string ner_entity_types = 5;
+}
+
+message ModelBundleRef {
+  string bundle_id = 1;
+  map<string, string> component_keys = 2;
+}
+
+message AnalysisOptions {
+  bool include_probabilities = 1;
+  bool clear_adaptive_data = 2;
+  InferenceBackend inference_backend = 3;
+  optional int32 max_text_length = 4;
+  optional string onnx_embedding_model_id = 5;
+}
+
+message ModelDescriptor {
+  string hash = 1;
+  string name = 2;
+  string locale = 3;
+  string component_type = 4;
+  repeated string languages = 5;
+  repeated PipelineStep supported_steps = 6;
+  map<string, string> attributes = 7;
+}
+
+message ModelBundleInfo {
+  string bundle_id = 1;
+  repeated ModelDescriptor models = 2;
+  repeated string supported_languages = 3;
+  repeated PipelineStep supported_steps = 4;
+}
diff --git 
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
new file mode 100644
index 00000000..ddf3a066
--- /dev/null
+++ 
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "org/apache/opennlp/grpc/v1/opennlp_document.proto";
+import "org/apache/opennlp/grpc/v1/opennlp_pipeline.proto";
+
+service OpenNlpAnalysisService {
+  rpc AnalyzeDocument(AnalyzeDocumentRequest) returns 
(AnalyzeDocumentResponse);
+  rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+  rpc ListModelBundles(ListModelBundlesRequest) returns 
(ListModelBundlesResponse);
+}
+
+message AnalyzeDocumentRequest {
+  OpenNlpDocument document = 1;
+  // Classic single-profile usage.
+  AnalysisProfile profile = 2;
+  AnalysisOptions options = 3;
+  optional string profile_id = 4;
+
+  // Per-chunking-strategy multi-config. Chunking first; for each strategy
+  // the caller names exactly which embedding_model_ids to attach to its 
chunks.
+  // The server shares the base NLP analysis (sentences etc.) across all 
entries.
+  repeated ChunkEmbedConfigEntry chunk_embed_configs = 5;
+}
+
+// One chunking strategy + the concrete embedding models to use for the chunks
+// it produces. Corresponds 1:1 to one ChunkEmbeddingGroup in the reply.
+message ChunkEmbedConfigEntry {
+  string config_id = 1;                 // becomes the group's group_id
+  optional string result_set_name = 2;
+  optional AnalysisProfile profile = 3;
+  optional ChunkingSpec chunking = 4;
+  // The embeddings wanted for *this* chunking strategy's chunks.
+  // Not a blind NxM; explicit per-strategy list.
+  repeated string embedding_model_ids = 5;
+}
+
+message AnalyzeDocumentResponse {
+  OpenNlpDocument document = 1;
+  repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+  PipelineStep step = 1;
+  string message = 2;
+  DiagnosticSeverity severity = 3;
+  optional string component_key = 4;
+}
+
+enum DiagnosticSeverity {
+  DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+  INFO = 1;
+  WARNING = 2;
+  ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+
+message GetServiceInfoResponse {
+  string opennlp_version = 1;
+  string api_version = 2;
+  repeated string available_profile_ids = 3;
+  repeated PipelineStep supported_steps = 4;
+}
+
+message ListModelBundlesRequest {}
+
+message ListModelBundlesResponse {
+  repeated ModelBundleInfo bundles = 1;
+}
diff --git 
a/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
 
b/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
index 70ec586d..caf817bc 100644
--- 
a/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
+++ 
b/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
@@ -122,7 +122,7 @@ svm_learn -t 5 -D 0 learning_file model_file - Ð´Ñ€ÑƒÐ³Ð¾Ð¹ 
Ð²Ð
 
 2. svm_classify.exe Ð±ÐµÑ€ÐµÑ‚ Ñ„Ð°Ð¹Ð» Ñ� Ñ‚ÐµÑ�Ñ‚Ð¾Ð²Ñ‹Ð¼Ð¸ 
Ð¿Ñ€Ð¸Ð¼ÐµÑ€Ð°Ð¼Ð¸, Ñ„Ð°Ð¹Ð» Ñ� Ð¼Ð¾Ð´ÐµÐ»ÑŒÑŽ, Ð¿Ð¾Ñ�Ñ‚Ñ€Ð¾ÐµÐ½Ð½Ñ‹Ð¹ 
svm_learn, Ð¸ Ð·Ð°Ð¿Ð¸Ñ�Ñ‹Ð²Ð°ÐµÑ‚ Ñ€ÐµÐ·ÑƒÐ»ÑŒÑ‚Ð°Ñ‚Ñ‹ Ð¾Ð±ÑƒÑ‡ÐµÐ½Ð¸Ñ� Ð² 
Ñ„Ð°Ð¹Ð» predictions_file.
 
-Ð—Ð°Ð¿ÑƒÑ�Ðº:     svm_classify example_file model_file predictions_file
+Ð-Ð°Ð¿ÑƒÑ�Ðº:     svm_classify example_file model_file predictions_file
 
 Ð¤Ð°Ð¹Ð» Ð¸Ð¼ÐµÐµÑ‚ Ñ‚Ð¾Ñ‚ Ð¶Ðµ Ñ„Ð¾Ñ€Ð¼Ð°Ñ‚, Ñ‡Ñ‚Ð¾ Ð¸ Ð²Ñ…Ð¾Ð´Ð½Ñ‹Ðµ 
Ð¿Ñ€Ð¸Ð¼ÐµÑ€Ñ‹. ÐžÐ±Ñ€Ð°Ð·ÐµÑ† Ð»ÐµÐ¶Ð¸Ñ‚ Ð² Ð°Ñ€Ñ…Ð¸Ð²Ðµ Ð½Ð° 
Ñ�Ñ‚Ñ€Ð°Ð½Ð¸Ñ‡ÐºÐµ ÐœÐ¾Ñ�ÐºÐ¸Ñ‚Ñ‚Ð¸. 
 ÐœÐ¾Ð¶Ð½Ð¾ Ñ�Ñ€Ð°Ð·Ñƒ Ð¶Ðµ ÑƒÐºÐ°Ð·Ñ‹Ð²Ð°Ñ‚ÑŒ, Ðº ÐºÐ°ÐºÐ¾Ð¼Ñƒ ÐºÐ»Ð°Ñ�Ñ�Ñƒ 
Ð¾Ñ‚Ð½Ð¾Ñ�Ð¸Ñ‚Ñ�Ñ� Ð¿Ñ€Ð¸Ð¼ÐµÑ€ (1 Ð¸Ð»Ð¸ -1 Ð² Ð½Ð°Ñ‡Ð°Ð»Ðµ Ñ�Ñ‚Ñ€Ð¾ÐºÐ¸). Ð’ 
Ñ�Ñ‚Ð¾Ð¼ Ñ�Ð»ÑƒÑ‡Ð°Ðµ Ñ‚Ð¾Ñ‡Ð½Ð¾Ñ�Ñ‚ÑŒ Ð¸ Ð¿Ð¾Ð»Ð½Ð¾Ñ‚Ð° Ð¾Ñ†ÐµÐ½Ð¸Ð²Ð°ÑŽÑ‚Ñ�Ñ� 
Ð°Ð²Ñ‚Ð¾Ð¼Ð°Ñ‚Ð¸Ñ‡ÐµÑ�ÐºÐ¸. Ð˜Ð»Ð¸ Ñ�Ñ‚Ð°Ð²Ð¸Ñ‚ÑŒ Ñ‚Ð°Ð¼ 0.
diff --git a/opennlp-similarity/src/test/resources/sentence_parseObject.csv 
b/opennlp-similarity/src/test/resources/sentence_parseObject.csv
index c11ec1d1..e1f4622e 100644
--- a/opennlp-similarity/src/test/resources/sentence_parseObject.csv
+++ b/opennlp-similarity/src/test/resources/sentence_parseObject.csv
@@ -254,7 +254,7 @@
 "B-NP","B-VP","I-VP","O"
 "NNP","VBD","VBG","CC"
 "Albert","began","reading","and"
-"The Patriot Post — IRS Target of .  the day before the first Sensitive Case 
Reports on conservative groups were .  Obama said that if not for .  ."
+"The Patriot Post - IRS Target of .  the day before the first Sensitive Case 
Reports on conservative groups were .  Obama said that if not for .  ."
 
"B-NP","I-NP","I-NP","I-NP","I-NP","B-PP","B-NP","I-NP","B-PP","B-NP","I-NP","I-NP","I-NP","I-NP","B-PP","B-NP","I-NP","B-VP","B-NP","B-VP","B-SBAR","B-PP","B-NP","I-NP"
 
"DT","NNP","NNP","NNP","NNP","IN","DT","NN","IN","DT","JJ","NN","NN","NNS","IN","JJ","NNS","VBD","NNP","VBD","IN","IN","RB","IN"
 
"The","Patriot","Post","IRS","Target","of","the","day","before","the","first","Sensitive","Case","Reports","on","conservative","groups","were","Obama","said","that","if","not","for"
@@ -630,7 +630,7 @@
 
"B-NP","I-NP","B-VP","B-NP","B-VP","I-VP","B-PRT","B-PP","B-NP","I-NP","I-NP","B-NP","I-NP","B-NP","I-NP","B-PP","B-NP","I-NP","I-NP","B-VP","I-VP","I-VP","I-VP","B-NP","I-NP","B-PP","B-NP","I-NP","B-VP","B-PP","B-NP","B-PP","B-NP","I-NP","I-NP"
 
"NNP","NNS","VBP","PRP","VBP","VBN","IN","IN","DT","JJ","NN","DT","NN","DT","NN","IN","DT","JJ","NNS","VBN","TO","VB","VB","JJ","NNS","IN","DT","NN","VBZ","IN","NN","IN","VBN","JJ","NN"
 
"WASHINGTON","Lawmakers","say","they","re","outraged","that","for","the","second","time","this","month","a","member","of","the","armed","forces","assigned","to","help","prevent","sexual","assaults","in","the","military","is","under","investigation","for","alleged","sexual","misconduct"
-"Albert Einstein, (born March 14, 1879, Ulm, Württemberg, Germany—died April 
18, 1955, Princeton, New Jersey, U.S "
+"Albert Einstein, (born March 14, 1879, Ulm, Württemberg, Germany-died April 
18, 1955, Princeton, New Jersey, U.S "
 
"B-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP"
 
"NNP","NNP","VBN","NNP","CD","CD","NNP","NNP","NNP","VBD","NNP","CD","CD","NNP","NNP","NNP","NNP","NNP"
 
"Albert","Einstein","born","March","14","1879","Ulm","Württemberg","Germany","died","April","18","1955","Princeton","New","Jersey","U","S"
@@ -858,7 +858,7 @@
 
"B-NP","B-PP","I-NP","I-NP","I-NP","B-PP","B-NP","B-VP","B-NP","B-ADJP","B-NP","I-NP","B-NP","I-NP","B-PP","B-NP","B-VP","B-NP","B-VP","B-NP","B-PP","B-NP","I-NP","I-NP","B-PP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-SBAR","B-NP","B-VP","B-NP","I-NP","I-NP","I-NP","O","B-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","O"
 
"PRP","CD","CD","NNP","NN","IN","NNS","NNS","NNS","JJ","CD","NNS","CD","NNS","IN","EX","VBP","JJ","VBP","CD","IN","CD","JJ","NNS","IN","NN","NNS","NNS","NN","VBG","NNS","FW","NNP","NNP","VBD","DT","JJ","NN","NN","NNP","NNP","NNP","IN","PRP","VBP","JJ","NNP","NN","NNS","CC","NNP","VBD","NNS","NNP","NNP","NNP","NNP","NNP","CD"
 
"Item","361","380","Profile","picture","of","djones","djones","djones","active","6","months","3","weeks","ago","There","are","many","buy","one","get","one","free","offers","for","area","restaurants","museums","zoo","sporting","events","etc","Vicki","Todd","wrote","a","new","blog","post","PLEASE","REMEMBER","NOTE","although","we","collect","Swiss","Valley","milk","caps","and","Campbell","s","djones","Rock","Island","Milan","School","District","41"
-"Mark Alexander: Obama's 'IRS Enemies List' — Updated ..."
+"Mark Alexander: Obama's 'IRS Enemies List' - Updated ..."
 "B-NP","I-NP","I-NP","B-VP","B-NP","I-NP","B-VP","I-VP"
 "NN","NN","NN","VBZ","JJ","NNS","NN","VBN"
 "Mark","Alexander","Obama","s","IRS","Enemies","List","Updated"

(opennlp-sandbox) branch OPENNLP-1833-grpc-expansion updated: initial design doc and merged from main

Reply via email to