This is an automated email from the ASF dual-hosted git repository.
krickert pushed a commit to branch OPENNLP-1833-grpc-expansion
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git
The following commit(s) were added to refs/heads/OPENNLP-1833-grpc-expansion by
this push:
new ee9b8750 initial design doc and merged from main
ee9b8750 is described below
commit ee9b8750bb02597eccd707bb5ff481d71dac7b5f
Author: Kristian Rickert <[email protected]>
AuthorDate: Sat Jun 6 17:13:39 2026 -0400
initial design doc and merged from main
---
opennlp-grpc/docs/rfc/opennlp-grpc-design.md | 993 +++++++++++++++++++++
.../docs/rfc/opennlp-grpc-jira-proposal.md | 233 +++++
.../apache/opennlp/grpc/v1/opennlp_document.proto | 165 ++++
.../apache/opennlp/grpc/v1/opennlp_pipeline.proto | 117 +++
.../apache/opennlp/grpc/v1/opennlp_service.proto | 92 ++
.../tools/jsmlearning/TreeKernelRunner.java | 2 +-
.../src/test/resources/sentence_parseObject.csv | 6 +-
7 files changed, 1604 insertions(+), 4 deletions(-)
diff --git a/opennlp-grpc/docs/rfc/opennlp-grpc-design.md
b/opennlp-grpc/docs/rfc/opennlp-grpc-design.md
new file mode 100644
index 00000000..c9a8d52c
--- /dev/null
+++ b/opennlp-grpc/docs/rfc/opennlp-grpc-design.md
@@ -0,0 +1,993 @@
+# OpenNLP gRPC API - Design Document (Phase 1)
+
+## Summary
+
+OpenNLP is a mature JVM library. Teams load models, run tokenizers and
taggers, extract entities, and-more and more-generate embeddings, all
in-process inside a Java application. That model still makes sense for many use
cases. But a lot of modern stacks do not look like that: Python data pipelines,
Go or Rust microservices, search platforms that want annotated text with chunks
and vectors in one pass, and deployments where GPU-backed inference belongs on
a shared service rather than in [...]
+
+The sandbox gRPC proof of concept showed that exposing OpenNLP over the
network works. This RFC is the next step: evolve that POC into a
**document-centric, language-neutral** API. You send text (or a partially
analyzed document) and get back a single, structured result-sentences and
tokens, named entities, optional syntactic chunks, multiple segmentation
strategies each with their own embeddings, and diagnostics when something
optional did not quite land. The core library stays free of [...]
+
+### Why we're doing this
+
+The legacy sandbox exposed separate RPCs per tool-tokenize here, tag there,
find entities somewhere else. That is faithful to how the Java API is
organized, but it pushes orchestration onto every client. Real workflows need
the full picture: linguistic structure, retrieval-oriented chunks, and vectors,
produced in the right order without the caller wiring six calls together.
+
+We also hear clearly from the community that **chunking and embeddings belong
in v1**, not as a later add-on. Search, hybrid retrieval, and RAG-style
indexing all want "give me this document, chunked and embedded, ready to
index"-often with more than one chunking strategy in the same run so you can
compare sentence-level vs. fixed-window approaches without paying for
tokenization and NER three times over.
+
+Finally, OpenNLP should not require every downstream system to host the JVM. A
strongly typed binary protocol-protobuf over gRPC-is how many services already
talk to each other. Meeting that expectation lowers the friction for polyglot
teams and for platforms that already standardize on gRPC for internal APIs.
+
+### What it can unlock
+
+**gRPC-native integration.** Systems that already speak gRPC get a first-class
way to call OpenNLP: discover what models and profiles the server offers,
submit a document, receive a typed result. No one-off REST schemas, no ad-hoc
JSON field naming, no JNI shim in every language binding.
+
+**Polyglot document enrichment.** A Python ingestion job, a Go API layer, and
a Rust indexer can all send the same document shape and receive the same
annotated structure back. That makes cross-language pipelines easier to build,
test, and operate-you are not maintaining parallel "how we call OpenNLP"
stories in every repo.
+
+**Streaming and incremental results.** Long documents and live text feeds
should not block on one monolithic response at the end. The contract is shaped
so analysis can stream partial results as they are ready-sentences as they are
detected, chunks and embeddings as groups complete-rather than forcing the
client to wait for the entire pipeline to finish.
+
+**Shared NLP infrastructure.** One well-provisioned OpenNLP server-with GPUs
when embedding workloads warrant it-can serve many lightweight clients. Model
loading, versioning, and heavy inference concentrate where the hardware is,
instead of duplicating JVMs and model bundles across every service.
+
+**Search, RAG, and semantic indexing in one shot.** Multiple chunk-and-embed
configurations in a single analysis run means a single ingestion path can feed
a sentence-level index, a fixed-window RAG store, and an experimental strategy
side by side. The linguistic backbone is computed once; each strategy gets its
own group of chunks with embeddings carried inside them.
+
+**Two-way flexibility for JVM teams.** Non-JVM clients call the server over
the network. Java applications can do the same when they want to offload heavy
steps-or keep using a pure-Java processor in-process when that is simpler. Same
conceptual document model either way, with a path later toward a small
gRPC-free core type in opennlp-api.
+
+Phase 1 is agreement on this contract-the protos and the design captured here.
Implementation in the sandbox and graduation toward an Apache OpenNLP release
follow once the community is comfortable with the shape.
+
+## Design
+
+**Canonical location:** Living design doc for
[OPENNLP-1833](https://issues.apache.org/jira/browse/OPENNLP-1833). Active work
happens in **opennlp-sandbox** on branch `OPENNLP-1833-grpc-expansion`. Proto
sources: `opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/`.
+
+
+| Field | Value |
+| -------------------- | ------------------------------------------------ |
+| **Status** | Draft RFC |
+| **Version** | 0.5 |
+| **API version** | `v1` |
+| **OpenNLP baseline** | 3.0.0-SNAPSHOT (JDK 21+) |
+| **Companion** | [JIRA proposal](./opennlp-grpc-jira-proposal.md) |
+
+
+---
+
+# Part I - Overview
+
+## Where we are
+
+This RFC evolves the sandbox gRPC POC into a **document-centric,
language-neutral** contract: one `AnalyzeDocument` RPC that takes `raw_text` in
and returns an enriched `OpenNlpDocument` - sentences, tokens, entities,
optional chunk groups with embeddings, diagnostics, and more.
+
+**What exists today (on the sandbox branch):**
+
+
+| Artifact
| Status |
+|
---------------------------------------------------------------------------------------
| ------------------------------- |
+| Legacy POC (`opennlp.proto`, 3 granular services, `OpenNLPServer`)
| Committed, unchanged |
+| v1 protos (`opennlp_document.proto`, `opennlp_pipeline.proto`,
`opennlp_service.proto`) | Written, not yet wired to build |
+| This RFC + JIRA companion
| Written |
+| v1 Java processor / server / codegen
| Not started |
+| Core `opennlp-api` `Document` interface
| Proposed for later |
+
+
+**Phase 1 deliverable:** design consensus + stable wire contract. **Phase 2:**
implementation in the sandbox, then graduation to `apache/opennlp` (target
~3.1.x per community feedback).
+
+## Target architecture
+
+One call. Shared NLP backbone computed once. Multiple chunking strategies,
each with explicitly named embedding models. Embeddings live **inside** each
chunk in the reply.
+
+```mermaid
+flowchart TB
+ subgraph request [AnalyzeDocumentRequest]
+ DOC[OpenNlpDocument raw_text]
+ BASE[profile / profile_id - optional]
+ MULTI[chunk_embed_configs]
+ MULTI --> E1["strategy A + embedding_model_ids"]
+ MULTI --> E2["strategy B + embedding_model_ids"]
+ end
+
+ subgraph server [Server - Phase 2]
+ PROC[Pure-Java processor]
+ PROC --> SHARED["Shared NLP once: sentences, tokens, NER, ..."]
+ PROC --> G1["Group A: chunks with embeddings inside"]
+ PROC --> G2["Group B: chunks with embeddings inside"]
+ end
+
+ subgraph response [OpenNlpDocument reply]
+ SENT[AnnotatedSentence backbone]
+ GRP[chunk_embedding_groups]
+ GRP --> C1["Chunk: text_content + embeddings"]
+ end
+
+ request --> server
+ server --> response
+```
+
+
+
+**Request rule:** chunking strategy first. Per strategy, the caller names
which `embedding_model_ids` to apply to that strategy's chunks. This is **not**
an automatic N×M cartesian product unless the caller explicitly requests
multiple strategies.
+
+**Reply rule:** `ChunkEmbeddingGroup` holds the chunks for one strategy. Each
`Chunk` repeats `text_content` for convenience and carries `repeated
EmbeddingResult embeddings` inside it. The shared `AnnotatedSentence` list is
the linguistic backbone computed once.
+
+## Key design decisions
+
+
+| Topic | Decision
|
+| ------------------------ |
--------------------------------------------------------------------------------------
|
+| **Primary RPC** | `OpenNlpAnalysisService.AnalyzeDocument`
|
+| **Package** | `org.apache.opennlp.grpc.v1`
|
+| **Legacy services** | Retain under `org.apache.opennlp.grpc.legacy.v1`
during transition |
+| **Core library** | Stays gRPC-free; wire API in optional Maven
modules |
+| **Build** | Maven + `protobuf-maven-plugin` only
|
+| **CHUNK + EMBED** | First-class v1 steps (`ChunkerME`,
`SentenceVectorsDL`, segmentation chunking) |
+| **GPU / providers** | Hot-swappable via `InferenceBackend` + provider
SPI; CUDA/OpenVINO as optional modules |
+| **Multi-group** | `repeated ChunkEmbedConfigEntry` in request →
`repeated ChunkEmbeddingGroup` in reply |
+| **Embeddings placement** | Inside `Chunk`, not a separate flat list per
model |
+| **Partial failures** | Required steps fail the RPC; optional steps
return best-effort + diagnostics |
+| **Stateless contract** | One document per RPC; `clear_adaptive_data`
controls NER adaptive state only |
+
+
+## Community consensus (dev@, May–June 2026)
+
+Feedback from Martin Wiesner, Richard Zowalla, and others on OPENNLP-1833:
+
+- **Sandbox-first** - iterate here, graduate to main after review; no rush for
3.0.0.
+- **Neutral core `Document` interface** - small gRPC-free type in
`opennlp-api` later; `OpenNlpDocument` is the wire form.
+- **Embeddings and chunking in v1** - not deferred; GPU acceleration via
optional provider modules.
+- **Discovery** - `ListModelBundles` + `GetServiceInfo` must expose enough
metadata to choose bundles/profiles.
+- **Two-way usage** - other languages call the server; JVM code can call the
server via stubs for heavy steps; a pure-Java processor underneath supports
in-process use too.
+
+## What comes next
+
+1. Community review of this RFC + v1 protos.
+2. Wire `protobuf-maven-plugin`, generate stubs.
+3. Pure-Java processor: shared NLP once → per-strategy groups → embeddings
inside chunks.
+4. Minimal `AnalyzeDocument` server implementation.
+5. Propose core `Document` / `AnalyzedDocument` API in `opennlp-api`.
+
+---
+
+# Part II - Specification
+
+## 1. Goals
+
+1. Define a **language-neutral, document-centric** gRPC contract for Apache
OpenNLP inference.
+2. Enable **cross-platform clients** (Python, Go, Rust, etc.) without JNI or
embedding the JVM in every service.
+3. Support a **single-call pipeline** (`AnalyzeDocument`) that replaces
client-side chaining of granular RPCs.
+4. Preserve a clean separation: **core library stays gRPC-free**; wire API
lives in optional Maven modules.
+5. Include **CHUNK and EMBED as first-class v1 steps** (using existing OpenNLP
`ChunkerME` and `SentenceVectorsDL` from opennlp-dl for ONNX embeddings).
Advanced GPU acceleration (CUDA via onnxruntime-gpu, OpenVINO for Intel) and
hot-swappable provider implementations live behind a narrow middle interface /
provider SPI; these can be delivered in separate optional modules/builds
without changing the wire contract or core processor.
+
+## 2. Non-goals
+
+See JIRA proposal. Additionally for Phase 1 design only: no server
implementation, no deployment guide, no performance SLAs.
+
+## 3. Background
+
+### 3.1 Main repository
+
+- Maven multi-module library; public API in `opennlp-api`, engines in
`opennlp-runtime`.
+- No `.proto` files or gRPC dependencies on `main`.
+- NLP tasks map to Java interfaces (`Tokenizer`, `SentenceDetector`,
`POSTagger`, `TokenNameFinder`, etc.).
+
+### 3.2 Sandbox POC
+
+Location:
[https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc](https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc)
+
+Modules:
+
+- `opennlp-grpc-api` - `opennlp.proto`, generated stubs
+- `opennlp-grpc-service` - server, per-tool services, directory/JAR model
scanning
+- `examples` - Python client
+
+Limitations motivating this redesign:
+
+- Per-tool services and string payloads
+- No document-level result aggregation
+- No pipeline profile or step diagnostics
+- `package opennlp` / `java_outer_classname` bundling (discouraged for
multi-file generation)
+
+### 3.3 UIMA reference pipeline
+
+`OpenNlpTextAnalyzer.xml` delegates: SentenceDetector → Tokenizer →
NameFinders → PosTagger → Chunker → Parser.
+
+The gRPC server orchestrator should mirror this **order** when steps are
enabled in `AnalysisProfile`.
+
+### 3.4 Deep learning / GPU (v1 + provider evolution)
+
+- `opennlp-dl`: ONNX Runtime support including `SentenceVectorsDL` for
embeddings, plus `NameFinderDL` and `DocumentCategorizerDL`. These are the
foundation for the v1 `EMBED` step (and future DL-backed NER/categorization).
+- `opennlp-dl-gpu`: swaps the CPU onnxruntime for `onnxruntime_gpu` (CUDA on
NVIDIA). This is one of the concrete provider implementations behind the
hot-swap story.
+- A narrow provider SPI / middle interface (behind `InferenceBackend` and
per-component selection in profiles/options) allows the pure-Java processor
(and thus the gRPC server) to dispatch `EMBED` (and later other steps) to
different backends. Concrete providers for CUDA, a future OpenVINO backend
(Intel GPU/accelerators), DJL, or even remote endpoints (KServe v2 or another
OpenNLP gRPC instance) can live in separate optional modules with their own
build artifacts and dependencies. The b [...]
+- CHUNK and EMBED (with basic ONNX) are in-scope for the initial v1 contract
and sandbox implementation. Advanced acceleration and additional providers are
implementation work that does not require wire changes.
+
+The initiating email for OPENNLP-1833 emphasizes GPU embeddings (CUDA for
NVIDIA, OpenVINO for Intel) with a hot-swappable middle interface whose
implementations are their own builds. This design makes that explicit via the
provider mechanism while keeping the `OpenNlpDocument` / `AnalyzeDocument`
contract stable.
+
+---
+
+## 4. Architecture
+
+### 4.1 Module layout (implementation phases 2+)
+
+```
+apache/opennlp/
+├── opennlp-api/ # unchanged
+├── opennlp-runtime/ # unchanged
+├── opennlp-grpc-api/ # NEW: protos + generated code
+├── opennlp-grpc-server/ # NEW: Netty/shaded server, orchestrator
+└── opennlp-grpc-examples/ # NEW: optional samples
+```
+
+Dependency rule: `opennlp-grpc-server` → `opennlp-grpc-api`,
`opennlp-runtime`, `opennlp-model-resolver`; optional `opennlp-dl-gpu`.
+
+### 4.2 Three-layer proto model
+
+```mermaid
+flowchart TB
+ subgraph L3 [Layer 3 - Service]
+ SVC[OpenNlpAnalysisService]
+ end
+ subgraph L2 [Layer 2 - Pipeline]
+ PROF[AnalysisProfile]
+ OPT[AnalysisOptions]
+ BUNDLE[ModelBundleRef]
+ end
+ subgraph L1 [Layer 1 - Domain]
+ DOC[OpenNlpDocument]
+ end
+ SVC --> PROF
+ SVC --> OPT
+ PROF --> BUNDLE
+ PROF --> DOC
+ SVC --> DOC
+```
+
+
+
+
+| Layer | File (proposed) | Responsibility
|
+| ----- | ------------------------ |
-------------------------------------------------------- |
+| 1 | `opennlp_document.proto` | Document, spans, tokens, entities,
analytics, embeddings |
+| 2 | `opennlp_pipeline.proto` | Profiles, steps, model refs, options,
backends |
+| 3 | `opennlp_service.proto` | gRPC services and request/response
envelopes |
+
+
+All files share `package org.apache.opennlp.grpc.v1`.
+
+### 4.3 Runtime flow
+
+```mermaid
+sequenceDiagram
+ participant C as Client
+ participant S as GrpcServer
+ participant M as ModelBundleCache
+ participant N as OpenNlpRuntime
+
+ C->>S: AnalyzeDocument(doc, profile, options)
+ S->>S: Validate doc_id, raw_text
+ S->>M: Resolve ModelBundleRef
+ alt LANGUAGE_DETECT in profile
+ S->>N: LanguageDetectorME
+ end
+ S->>N: SentenceDetectorME(raw_text)
+ loop Each sentence span
+ S->>N: TokenizerME
+ S->>N: POSTaggerME
+ opt NER in profile
+ S->>N: NameFinderME per model type
+ end
+ opt CHUNK in profile
+ S->>N: ChunkerME
+ end
+ end
+ S->>S: CharSpanMapper to OpenNlpDocument
+ S-->>C: AnalyzeDocumentResponse
+```
+
+
+
+---
+
+## 5. Offset and span contract
+
+OpenNLP Java APIs mix coordinate systems:
+
+
+| API | Span reference |
+| -------------------------------- | --------------------------------- |
+| `Tokenizer.tokenizePos` | Character offsets in input string |
+| `SentenceDetector.sentPosDetect` | Character offsets in document |
+| `TokenNameFinder.find(String[])` | **Token indices** in sentence |
+| `DocumentNameFinder` | Per-sentence token indices |
+
+
+**Wire contract (mandatory for v1):**
+
+- Every `CharSpan` in `OpenNlpDocument` and in RPC responses MUST use
`CoordinateSpace.CHAR_DOCUMENT` unless explicitly documented otherwise.
+- Offsets are **half-open** `[start, end)` into `raw_text`, matching
`opennlp.tools.util.Span`.
+- The server is solely responsible for converting token-index spans from
`NameFinderME` to character spans before returning.
+
+---
+
+## 6. Model lifecycle
+
+### 6.1 Server-side models
+
+- Classic models: Java-serialized `.bin` in ZIP/JAR (unchanged).
+- Models are **never** sent inline in `AnalyzeDocumentRequest`.
+- Server loads from configurable directory/classpath (port sandbox
`model.location`, wildcards).
+
+### 6.2 ModelBundleRef and discovery
+
+`ModelBundleRef` is a compact logical handle used in requests:
+
+```protobuf
+message ModelBundleRef {
+ string bundle_id = 1;
+ map<string, string> component_keys = 2;
+}
+```
+
+Example `component_keys`: `tokenizer`, `sentence_detector`, `pos`,
`ner_person`, `ner_org`, `embed_minilm`, `langdetect`.
+
+Server config (or a model resolver) maps `bundle_id` → concrete
artifacts/paths. Clients can send only `bundle_id` when using server-defined
profiles.
+
+**Discovery (addresses community feedback)**: A bare `bundle_id` is not
sufficient for clients to explore what is available. The service therefore
exposes:
+
+- `GetServiceInfo` → high-level `available_profile_ids` and `supported_steps`.
+- `ListModelBundles` → `ListModelBundlesResponse` containing `ModelBundleInfo`
entries.
+
+`ModelBundleInfo` / `ModelDescriptor` (see full proto in 11.2–11.3) are
intended to carry enough metadata for real client discovery:
+
+- `locale` / language.
+- Component types present (e.g. "sentence_detector", "embed").
+- Supported or typical `PipelineStep` values this bundle is intended to serve.
+- Optional free-form capabilities or tags.
+
+Implementations should populate these fields so that a client can list
bundles, filter by language or capability (e.g. "has an embed component"), and
then pick a `bundle_id` or `profile_id`. The exact richness of the descriptors
can grow over time without breaking v1 clients (additive fields only).
+
+In the sandbox implementation we will start with what the existing
`ModelFinderUtil` + directory scanning can provide and extend it for ONNX
embedding artifacts (model + vocab pairs) as first-class bundle components.
+
+### 6.3 Profiles
+
+Predefined profiles in server config (e.g. `en-basic`, `en-ner`):
+
+```ini
+profile.en-basic.bundle_id=en-default
+profile.en-basic.steps=SENTENCE_DETECT,TOKENIZE,POS_TAG
+```
+
+`GetServiceInfo` returns available `profile_id` values.
+
+### 6.4 Thread safety
+
+OpenNLP 3.0 documents thread-safe `*ME` classes. The server holds **one
instance per loaded model** in `ModelBundleCache`, shared across gRPC executor
threads.
+
+### 6.5 Stateful NER and adaptive data
+
+Certain OpenNLP components (notably `NameFinderME` / `TokenNameFinder`)
maintain "adaptive data" that can improve consistency *within a single
document* (e.g., once "John" is tagged as a person early in a long text, later
mentions of "John" can benefit from that context).
`TokenNameFinder.clearAdaptiveData()` resets this state.
+
+In the gRPC contract:
+
+- `AnalysisOptions.clear_adaptive_data` (default: `true`) controls whether the
server calls `clearAdaptiveData()` on applicable components **after**
processing the current `AnalyzeDocument` request.
+- `true` (the default) ensures that each RPC is independent with respect to
adaptive state. This matches the common expectation of a stateless
document-centric API.
+- `false` leaves the adaptive state in the cached `*ME` instance for the
bundle. A *sequence* of calls from the same logical client/session that target
the same bundle can therefore benefit from cross-document (but
within-"session") adaptive hints. This is an advanced, opt-in behavior and is
not the normal mode for the 1:1 document contract.
+
+### 6.6 Stateless RPC contract
+
+Each `AnalyzeDocument` call is a self-contained, stateless operation on the
wire: one `raw_text` document in, one enriched `OpenNlpDocument` (plus
diagnostics) out. There is no session, cursor, or cross-call mutable state in
the public contract.
+
+Adaptive data (6.5) and any internal caches (model instances, bundle
resolution) are implementation details of the server-side processor and the
specific OpenNLP components. They are scoped to a loaded bundle inside the
server process and do not leak into the protobuf messages or require clients to
manage server-side sessions.
+
+If a deployment needs stateful document sequences (for example, a long-running
"conversation" or a large report split across multiple calls that should share
NER adaptive data), it can do so by:
+
+- Using the same `bundle_id` / profile.
+- Setting `clear_adaptive_data=false` for the duration of the sequence.
+- Managing its own correlation (e.g. via `doc_id` or metadata) and eventually
calling with `clear_adaptive_data=true` (or a fresh bundle context) to reset.
+
+Cross-client or long-lived shared mutable state across unrelated documents is
strongly discouraged and outside the intended use of the v1 contract.
+
+---
+
+## 7. Error handling and partial results policy
+
+**gRPC status codes** are used for fatal failures that prevent a useful
response:
+
+- `INVALID_ARGUMENT`: bad request (missing raw_text, unknown profile_id when
no inline profile, invalid options, etc.).
+- `NOT_FOUND`: unknown `profile_id` or `bundle_id` that cannot be resolved.
+- `INTERNAL`: unrecoverable model or orchestration error after a step has
started.
+
+**Per-step diagnostics** are always populated in
`AnalyzeDocumentResponse.diagnostics` (even on success paths) for observability:
+
+- `INFO`: step skipped because it was not requested or not applicable.
+- `WARNING`: non-fatal issue (e.g. optional NER type had no model in the
bundle; a provider fell back; low-confidence result).
+- `ERROR`: a step failed but the server chose (or was configured) to continue
with partial results.
+
+**Partial results policy (addresses community feedback)**:
+
+- The contract favors **useful partial results** for non-critical failures so
that clients (especially cross-language ones doing RAG pipelines) can still
make progress.
+- If a **required** step (as determined by the `AnalysisProfile.steps` and
server policy for that profile - e.g. SENTENCE_DETECT or TOKENIZE when later
steps depend on them) fails with an ERROR diagnostic, the RPC **fails** with an
appropriate gRPC status and the diagnostics attached (best-effort document may
still be returned in the response for debugging, but callers must check status).
+- If an **optional** or **best-effort** step fails (e.g. a particular NER
entity type, CHUNK when the profile treats it as enrichment, or an EMBED
provider that is temporarily unavailable), the server returns `OK` (or the
normal response code) with:
+ - The document populated as far as successful steps reached.
+ - One or more `ProcessingDiagnostic` entries with severity `ERROR` or
`WARNING` and `component_key` identifying the failing piece (e.g. "ner_person",
"embed_minilm").
+- Profiles and future `AnalysisOptions` (e.g. a `strict` flag) can influence
what is treated as required vs. optional. The default is pragmatic partial
success for enrichment steps.
+- Clients should always inspect `diagnostics` rather than assuming a
successful status code means every requested step produced perfect output.
+
+This policy is intentionally documented early so that Python/Go/Rust/etc.
clients written against v1 have predictable behavior. The sandbox
implementation will include tests that exercise both full-success and
partial-failure paths.
+
+---
+
+## 8. Versioning and compatibility
+
+- Package path includes `v1`; breaking changes require `v2` package.
+- Use `reserved` for removed fields; never reuse field numbers.
+- `GetServiceInfoResponse.api_version` reports proto/API version string (e.g.
`"1.0.0"`).
+- Sandbox `opennlp` package services: **not** wire-compatible; migration guide
in Phase 2.
+
+---
+
+## 9. Security and operations (deployment)
+
+Out of scope for proto, noted for implementers:
+
+- TLS termination at load balancer or Netty server
+- Model path isolation and read-only mounts
+- Resource limits (max `raw_text` size, deadline per RPC)
+- gRPC reflection optional (sandbox supported via config)
+
+---
+
+## 10. Future phases (implementation - not Phase 1)
+
+
+| Phase | Deliverable
|
+| ----- |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| 2a | `opennlp-grpc-api` module, codegen, proto tests (including
CHUNK/EMBED shapes)
|
+| 2b | Pure-Java processor/orchestrator + `CharSpanMapper` + basic CHUNK
(sentence+overlap segmentation) + EMBED via `SentenceVectorsDL`; unit tests
|
+| 2c | gRPC server host (delegating to processor), config for
bundles/profiles, model discovery for both classic + ONNX embedding artifacts,
integration tests, sandbox port + updated Python example
|
+| 2d | Graduate modules to `apache/opennlp` (after community review);
optional core `Document` interface alignment if not already landed in 3.0.0-M4
|
+| 3 | Provider SPI hardening; first CUDA (`opennlp-dl-gpu`) and OpenVINO
provider modules as separate optional builds; hot-swap / priority selection
examples
|
+| 4 | Richer bundle discovery (languages, supported steps per bundle in
`ModelBundleInfo`), streaming variants of AnalyzeDocument or dedicated
chunk/embed streams, additional steps (PARSE, LEMMATIZE, SENTIMENT, etc.) if
not already in 2.x |
+| Later | Advanced semantic chunking driven by embeddings, more inference
backends (DJL direct, remote KServe via the same provider abstraction), Buf
Schema Registry publication, official multi-language client examples beyond
Python |
+
+
+---
+
+## 11. Full protobuf definitions (Phase 1 deliverable)
+
+**Note:** The authoritative .proto sources that will be used to start the
sandbox implementation
+live under `opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/`. The
blocks below
+are the canonical text form for the RFC and are kept in sync with those files.
+
+### 11.1 `opennlp_document.proto`
+
+```protobuf
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// Canonical 1:1 NLP document: text in, annotations out.
+// The base linguistic structure (sentences + tokens + entities + classic
chunks etc.)
+// is the shared "backbone" - computed once when any CHUNK or EMBED step is
active.
+// Multiple independent groups of (chunks + embeddings) can be produced in one
+// AnalyzeDocument run. Shared NLP analysis is computed once; each group is
+// traceable by its chunk/embedding config or profile fragment and attaches
+// vectors via source spans (a chunk or sentence region can have embeddings
+// from more than one model).
+message OpenNlpDocument {
+ string doc_id = 1;
+ string raw_text = 2;
+ optional string detected_language = 3;
+ optional float language_confidence = 4;
+ repeated AnnotatedSentence sentences = 5;
+ optional DocumentAnalytics analytics = 6;
+ map<string, string> metadata = 7;
+ repeated EmbeddingResult embeddings = 8; // denormalized "all embeddings
with spans" view (optional convenience)
+ optional DocumentClassification classification = 9;
+
+ // Multiple chunk + embedding result groups from this analysis (the primary
+ // way to carry more than one chunking strategy in one document).
+ // Shared linguistic backbone (sentences above) is computed once; each group
+ // applies its chunking strategy and named embedding models independently.
+ repeated ChunkEmbeddingGroup chunk_embedding_groups = 10;
+}
+
+// A named, traceable group of chunks produced by one chunking strategy,
+// with the requested embedding models attached *inside* each chunk.
+//
+// Chunking strategy lives at the group level. Embeddings live inside the
chunks
+// (multiple models per chunk, as named in the corresponding
ChunkEmbedConfigEntry
+// for this strategy). Chunking always precedes the embedding attachment in
the model.
+//
+// This gives the "repeat the actual chunk" (text + span) with its embeddings
+// as facets inside it.
+message ChunkEmbeddingGroup {
+ // Stable identifier for this group (from the request's config_id).
+ string group_id = 1;
+
+ // The chunking configuration / strategy that produced these chunks.
+ optional string chunk_config_id = 2;
+
+ // The embedding model IDs that were explicitly requested for this chunking
+ // strategy (copied from the request entry for traceability). Each Chunk in
+ // this group will have exactly these models' EmbeddingResult entries inside
it
+ // (or a subset if some failed with diagnostics).
+ repeated string embedding_model_ids = 3;
+
+ // Optional human name for the result set.
+ optional string result_set_name = 4;
+
+ // The chunks for this group. Each chunk carries its span (into raw_text),
+ // optional tag, the actual text content (repeated for client convenience),
+ // and the multiple embedding models attached directly to it.
+ repeated Chunk chunks = 5;
+
+ // Optional per-group metadata (timing, counts, provenance, etc.).
+ map<string, string> metadata = 6;
+
+ // Primary granularity for the chunks/vectors in this group (CHUNK for
+ // segmentation-style groups, SENTENCE, etc.).
+ optional EmbeddingGranularity granularity = 7;
+}
+
+// A chunk (segmentation or otherwise) with its embeddings attached inside.
+// This is the "chunk owns its embedding models" shape (chunking first).
+message Chunk {
+ CharSpan char_span = 1;
+ optional string chunk_tag = 2;
+
+ // The text content of the chunk (substring of the document's raw_text).
+ // Repeated for convenience so clients (especially non-Java) do not have to
+ // slice the original document text. The authoritative location is still
+ // given by char_span over the top-level raw_text.
+ optional string text_content = 3;
+
+ // The embeddings for this chunk from the models named for the containing
group.
+ // Multiple models per chunk are supported and expected when the chunking
+ // strategy requested several embedding_model_ids.
+ repeated EmbeddingResult embeddings = 4;
+
+ // Optional: if this chunk overlaps or contains specific sentences, the
+ // indices (0-based into the document's sentences list) can be recorded here
+ // for easy navigation without re-computing overlaps from spans.
+ repeated int32 contained_sentence_indices = 5;
+}
+
+message AnnotatedSentence {
+ CharSpan sentence_span = 1;
+ repeated Token tokens = 2;
+ repeated NamedEntity entities = 3;
+ optional ChunkResult syntactic_chunks = 4;
+ optional ParseTree parse_tree = 5;
+ optional string sentiment_label = 6;
+ optional float sentiment_confidence = 7;
+}
+
+message Token {
+ string text = 1;
+ CharSpan char_span = 2;
+ optional string pos_tag = 3;
+ optional string lemma = 4;
+ optional float pos_probability = 5;
+}
+
+message NamedEntity {
+ CharSpan char_span = 1;
+ string entity_type = 2;
+ optional double probability = 3;
+}
+
+message CharSpan {
+ int32 start = 1;
+ int32 end = 2;
+ CoordinateSpace space = 3;
+ optional string type = 4;
+ optional double probability = 5;
+}
+
+enum CoordinateSpace {
+ COORDINATE_SPACE_UNSPECIFIED = 0;
+ CHAR_DOCUMENT = 1;
+ TOKEN_SENTENCE = 2;
+}
+
+message DocumentAnalytics {
+ int32 total_tokens = 1;
+ int32 total_sentences = 2;
+ float noun_density = 3;
+ float verb_density = 4;
+ float adjective_density = 5;
+ float adverb_density = 6;
+ float content_word_ratio = 7;
+ int32 unique_lemma_count = 8;
+ float lexical_density = 9;
+}
+
+// Lightweight syntactic chunks (e.g. from ChunkerME) attached per
AnnotatedSentence.
+// Distinct from the configurable segmentation chunks in ChunkEmbeddingGroup.
+message ChunkResult {
+ repeated ChunkSpan chunks = 1;
+}
+
+message ChunkSpan {
+ CharSpan char_span = 1;
+ string chunk_tag = 2;
+}
+
+message ParseTree {
+ ParseNode root = 1;
+}
+
+message ParseNode {
+ string label = 1;
+ CharSpan span = 2;
+ repeated ParseNode children = 3;
+ optional double probability = 4;
+}
+
+message EmbeddingResult {
+ string model_id = 1;
+ repeated float vector = 2;
+ CharSpan source_span = 3;
+ EmbeddingGranularity granularity = 4;
+}
+
+enum EmbeddingGranularity {
+ EMBEDDING_GRANULARITY_UNSPECIFIED = 0;
+ DOCUMENT = 1;
+ SENTENCE = 2;
+ // Embeddings attached to (segmentation or syntactic) chunks produced by a
CHUNK step
+ // or by a ChunkEmbeddingGroup. This enables the "one chunk, multiple
embedding aspects"
+ // and multi-group use case (different chunker configs or embed models can
each
+ // produce their own group with CHUNK-granularity vectors).
+ CHUNK = 3;
+ // Future: paragraph, section, or custom spans. Consumers should match on
this
+ // enum (plus group metadata) rather than string parsing of config ids.
+ reserved 4 to 10;
+}
+
+message DocumentClassification {
+ string best_category = 1;
+ map<string, double> category_scores = 2;
+}
+```
+
+### 11.2 `opennlp_pipeline.proto`
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "opennlp_document.proto";
+
+enum PipelineStep {
+ PIPELINE_STEP_UNSPECIFIED = 0;
+ LANGUAGE_DETECT = 1;
+ SENTENCE_DETECT = 2;
+ TOKENIZE = 3;
+ POS_TAG = 4;
+ NER = 5;
+ CHUNK = 6;
+ PARSE = 7;
+ LEMMATIZE = 8;
+ DOC_CATEGORIZE = 9;
+ SENTIMENT = 10;
+ EMBED = 11;
+}
+
+enum POSTagFormat {
+ POS_TAG_FORMAT_UNSPECIFIED = 0;
+ UD = 1;
+ PENN = 2;
+ CUSTOM = 3;
+}
+
+enum InferenceBackend {
+ INFERENCE_BACKEND_UNSPECIFIED = 0;
+ OPENNLP_ME = 1;
+ ONNX_RUNTIME = 2;
+ ONNX_RUNTIME_GPU = 3;
+ reserved 4 to 9;
+ reserved "OPENVINO", "DJL";
+}
+
+message AnalysisProfile {
+ string profile_id = 1;
+ repeated PipelineStep steps = 2;
+ ModelBundleRef model_bundle = 3;
+ POSTagFormat pos_tag_format = 4;
+ repeated string ner_entity_types = 5;
+}
+
+message ModelBundleRef {
+ string bundle_id = 1;
+ map<string, string> component_keys = 2;
+}
+
+message AnalysisOptions {
+ bool include_probabilities = 1;
+ bool clear_adaptive_data = 2;
+ InferenceBackend inference_backend = 3;
+ optional int32 max_text_length = 4;
+ optional string onnx_embedding_model_id = 5;
+}
+
+message ModelDescriptor {
+ string hash = 1;
+ string name = 2;
+ string locale = 3;
+ string component_type = 4;
+ // Discovery aids (additive; populated by server for ListModelBundles)
+ repeated string languages = 5; // e.g. ["en", "eng"]
+ repeated PipelineStep supported_steps = 6;
+ map<string, string> attributes = 7; // free-form (e.g. "dim":"384",
"task":"embed")
+}
+
+message ModelBundleInfo {
+ string bundle_id = 1;
+ repeated ModelDescriptor models = 2;
+ // Optional aggregated view for convenience
+ repeated string supported_languages = 3;
+ repeated PipelineStep supported_steps = 4;
+}
+```
+
+### 11.3 `opennlp_service.proto`
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "opennlp_document.proto";
+import "opennlp_pipeline.proto";
+
+service OpenNlpAnalysisService {
+ rpc AnalyzeDocument(AnalyzeDocumentRequest) returns
(AnalyzeDocumentResponse);
+ rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+ rpc ListModelBundles(ListModelBundlesRequest) returns
(ListModelBundlesResponse);
+}
+
+message AnalyzeDocumentRequest {
+ OpenNlpDocument document = 1;
+ // Single profile (classic usage). When multi-config is used (see below),
+ // this may be omitted or treated as a default/base profile.
+ AnalysisProfile profile = 2;
+ AnalysisOptions options = 3;
+ optional string profile_id = 4;
+
+// Multiple chunk/embed (or full profile) configurations for a single run.
+//
+// Chunking strategy always comes first. For each chunking strategy (config
entry)
+// the caller explicitly names the embedding models to apply to the chunks
produced
+// by that strategy. This is *not* an automatic full NxM cartesian product
across
+// all chunkers and all embedders unless the caller requests it.
+//
+// The server runs the common linguistic pipeline steps (SENTENCE_DETECT,
TOKENIZE,
+// POS, NER, etc.) only once (shared base structure in AnnotatedSentence), then
+// for each requested chunking strategy produces a ChunkEmbeddingGroup
containing
+// the chunks (with the actual chunk text/span repeated for convenience) and
the
+// requested embeddings attached *inside* each chunk.
+//
+// See Chunk and ChunkEmbeddingGroup below.
+repeated ChunkEmbedConfigEntry chunk_embed_configs = 5;
+}
+
+// Entry for a single chunking strategy + the specific embeddings wanted for
it.
+// Chunking first; embeddings are named per strategy.
+message ChunkEmbedConfigEntry {
+ // Stable id for the group that will be produced (becomes group_id).
+ // Example: "body-token-512" or "title-sentences".
+ string config_id = 1;
+
+ // Optional display name for the result set (e.g. "body_chunks_minilm").
+ optional string result_set_name = 2;
+
+ // Full profile (if you need complex step composition) OR the lightweight
chunking spec.
+ optional AnalysisProfile profile = 3;
+ optional ChunkingSpec chunking = 4;
+
+ // Explicit list of embedding models to run for the chunks of *this*
chunking strategy.
+ // The reply will attach exactly these models' vectors inside each Chunk in
the group.
+ // Order here can be used to order the repeated embeddings on each chunk if
desired.
+ repeated string embedding_model_ids = 5;
+}
+
+message AnalyzeDocumentResponse {
+ OpenNlpDocument document = 1;
+ repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+ PipelineStep step = 1;
+ string message = 2;
+ DiagnosticSeverity severity = 3;
+ optional string component_key = 4;
+}
+
+enum DiagnosticSeverity {
+ DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+ INFO = 1;
+ WARNING = 2;
+ ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+
+message GetServiceInfoResponse {
+ string opennlp_version = 1;
+ string api_version = 2;
+ repeated string available_profile_ids = 3;
+ repeated PipelineStep supported_steps = 4;
+}
+
+message ListModelBundlesRequest {}
+
+message ListModelBundlesResponse {
+ repeated ModelBundleInfo bundles = 1;
+}
+```
+
+### 11.4 Reserved legacy package (optional, Phase 2 discussion)
+
+If PMC requires sandbox compatibility, move existing services to:
+
+`package org.apache.opennlp.grpc.legacy.v1;` - **unchanged wire format** from
sandbox for one release, deprecated in favor of `OpenNlpAnalysisService`.
+
+---
+
+## 12. Example request/response
+
+### 12.1 Basic profile request (JSON representation for documentation)
+
+```json
+{
+ "document": {
+ "doc_id": "doc-001",
+ "raw_text": "John works at OpenNLP in New York.",
+ "metadata": { "source": "example" }
+ },
+ "profile_id": "en-basic",
+ "options": {
+ "include_probabilities": true,
+ "clear_adaptive_data": true
+ }
+}
+```
+
+### 12.2 Multi-group chunk + embed request
+
+Two chunking strategies, each with explicitly named embedding models (not an
automatic cartesian product):
+
+```json
+{
+ "document": {
+ "doc_id": "doc-002",
+ "raw_text": "John works at OpenNLP in New York. The team builds NLP tools."
+ },
+ "profile_id": "en-basic",
+ "chunk_embed_configs": [
+ {
+ "config_id": "sentence-chunks",
+ "chunking": { "strategy": "SENTENCE" },
+ "embedding_model_ids": ["minilm-l6-v2"]
+ },
+ {
+ "config_id": "fixed-window",
+ "chunking": { "strategy": "FIXED_CHAR", "max_chars": 128,
"overlap_chars": 16 },
+ "embedding_model_ids": ["minilm-l6-v2", "e5-small"]
+ }
+ ]
+}
+```
+
+### 12.3 Response (excerpt - multi-group)
+
+```json
+{
+ "document": {
+ "doc_id": "doc-002",
+ "raw_text": "John works at OpenNLP in New York. The team builds NLP
tools.",
+ "detected_language": "eng",
+ "sentences": [
+ {
+ "sentence_span": { "start": 0, "end": 38, "space": "CHAR_DOCUMENT" },
+ "tokens": [
+ { "text": "John", "char_span": { "start": 0, "end": 4, "space":
"CHAR_DOCUMENT" }, "pos_tag": "PROPN" }
+ ],
+ "entities": [
+ { "char_span": { "start": 0, "end": 4, "space": "CHAR_DOCUMENT" },
"entity_type": "person" }
+ ]
+ }
+ ],
+ "chunk_embedding_groups": [
+ {
+ "group_id": "sentence-chunks",
+ "chunk_config_id": "sentence-chunks",
+ "embedding_model_ids": ["minilm-l6-v2"],
+ "chunks": [
+ {
+ "char_span": { "start": 0, "end": 38, "space": "CHAR_DOCUMENT" },
+ "text_content": "John works at OpenNLP in New York.",
+ "embeddings": [
+ {
+ "model_id": "minilm-l6-v2",
+ "vector": [0.12, -0.04, 0.33],
+ "source_span": { "start": 0, "end": 38, "space":
"CHAR_DOCUMENT" },
+ "granularity": "CHUNK"
+ }
+ ]
+ }
+ ]
+ },
+ {
+ "group_id": "fixed-window",
+ "chunk_config_id": "fixed-window",
+ "embedding_model_ids": ["minilm-l6-v2", "e5-small"],
+ "chunks": [
+ {
+ "char_span": { "start": 0, "end": 64, "space": "CHAR_DOCUMENT" },
+ "text_content": "John works at OpenNLP in New York. The team
builds NLP tools.",
+ "embeddings": [
+ { "model_id": "minilm-l6-v2", "vector": [0.11, -0.03, 0.31],
"granularity": "CHUNK" },
+ { "model_id": "e5-small", "vector": [0.09, 0.02, 0.28],
"granularity": "CHUNK" }
+ ]
+ }
+ ]
+ }
+ ]
+ },
+ "diagnostics": []
+}
+```
+
+---
+
+## 13. Mapping to Java API (implementation reference)
+
+
+| PipelineStep | Java type |
+| --------------- | ------------------------------------------------- |
+| LANGUAGE_DETECT | `LanguageDetectorME` |
+| SENTENCE_DETECT | `SentenceDetectorME` |
+| TOKENIZE | `TokenizerME` |
+| POS_TAG | `POSTaggerME` |
+| NER | `NameFinderME` (per type) |
+| CHUNK | `ChunkerME` |
+| PARSE | `Parser` |
+| LEMMATIZE | `LemmatizerME` |
+| DOC_CATEGORIZE | `DocumentCategorizerME` / `DocumentCategorizerDL` |
+| SENTIMENT | `SentimentME` |
+| EMBED | `SentenceVectorsDL` |
+
+
+---
+
+## 14. Open questions
+
+1. Maximum `raw_text` size - fixed limit vs streaming (streaming deferred).
+2. `profile_id` vs inline `AnalysisProfile` - both supported; precedence rule:
inline overrides server profile when `profile_id` also set?
+3. Batch RPC `AnalyzeDocuments` for throughput - v1 or v2?
+4. Publish protos to Buf BSR under `buf.build/apache/opennlp`?
+
+---
+
+## 15. Changelog
+
+
+| Version | Date | Changes
|
+| ------- | ---------- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| 0.5 | 2026-06-06 | Expand conversational Summary: motivation, what the
document-centric gRPC API unlocks (polyglot integration, streaming, shared
infrastructure, search/RAG). |
+| 0.4 | 2026-06-06 | Restructure into Part I (overview + target
architecture diagram) and Part II (specs). Remove external platform references.
Clarify chunk-first / embeddings-inside-chunk model. Fix duplicate proto
appendix messages. Add multi-group §12 examples.
|
+| 0.3 | 2026-06-06 | Canonical sandbox doc; multi-group chunk+embed
(`ChunkEmbeddingGroup`, `ChunkEmbedConfigEntry`, `EmbeddingGranularity.CHUNK`).
|
+| 0.2 | 2026-06-06 | Incorporate initial dev@ feedback (Martin Wiesner,
Richard Zowalla): neutral core `Document` interface proposal; sandbox-first +
Maven only; retain legacy granular services; target 3.1.x more likely; make
CHUNK + EMBED explicit v1 with GPU hot-swap provider story; expand ModelBundle
discovery; define partial-results policy; clarify stateless contract vs.
adaptive data; update goals, background, phases, and add Community RFC feedback
section. |
+| 0.1 | 2026-05-21 | Initial Phase 1 design + full protos
|
+
+
diff --git a/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md
b/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md
new file mode 100644
index 00000000..0e27028b
--- /dev/null
+++ b/opennlp-grpc/docs/rfc/opennlp-grpc-jira-proposal.md
@@ -0,0 +1,233 @@
+# JIRA Proposal: Document-Centric gRPC API for Apache OpenNLP 3.x
+
+> **Copy-paste guide:** Use the sections below when filing an issue at
https://issues.apache.org/jira/projects/OPENNLP
+> **Issue type:** Improvement / New Feature
+> **Component:** (suggest) `grpc` or `server` if available; otherwise leave
default
+> **Affects Version:** 3.0.0-SNAPSHOT
+> **Labels:** `grpc`, `rfc`, `api-design`
+
+---
+
+## Summary (JIRA title field)
+
+**Add document-centric gRPC API - evolve opennlp-sandbox POC with canonical
OpenNlpDocument and AnalyzeDocument RPC**
+
+---
+
+## Description (paste into JIRA description)
+
+### Problem
+
+Apache OpenNLP is primarily an **in-process Java library** (API, CLI, UIMA).
The README notes embedding in distributed pipelines (Flink, NiFi, Spark), but
there is **no standard wire contract** for cross-language clients or remote
inference.
+
+A proof-of-concept exists in the sandbox:
+
+- **Repository:**
https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc
+- **Current scope:** Three separate gRPC services (`SentenceDetectorService`,
`TokenizerTaggerService`, `PosTaggerService`) with string-based requests and
`model_hash` per call
+- **Gap:** No unified **document** message, no pipeline orchestration (the POC
has three separate string-based services), and clients must chain multiple
RPCs. The proposal brings NER, chunking (configurable segmentation + classic
ChunkerME), and embeddings (via SentenceVectorsDL + pluggable GPU providers)
into the single document-centric contract as first-class steps.
+
+Main OpenNLP (`apache/opennlp`) has **no gRPC modules** on `main`. OpenNLP 3.0
brings thread-safe `*ME` classes (JDK 21+), which makes a long-lived gRPC
server practical. The `opennlp-dl` / `opennlp-dl-gpu` modules already support
ONNX inference (including sentence embeddings via `SentenceVectorsDL`).
+
+### Proposal
+
+Evolve the sandbox POC into ASF-native modules (target: main repo after
consensus):
+
+| Module | Purpose |
+|--------|---------|
+| `opennlp-grpc-api` | Protocol Buffers + generated stubs (Java first;
descriptors for other languages) |
+| `opennlp-grpc-server` | gRPC server, model bundle registry, pipeline
orchestration |
+| `opennlp-grpc-examples` | Sample clients (e.g. Python) |
+
+**Core API change:** Introduce a canonical **`OpenNlpDocument`** message (1:1
text document in, enriched document out) and a primary **`AnalyzeDocument`**
RPC that runs a configurable NLP pipeline server-side-similar in spirit to the
existing UIMA `OpenNlpTextAnalyzer` composite, but as a language-neutral
contract.
+
+**Package naming (proposed):** `org.apache.opennlp.grpc.v1`
+
+### Non-goals (v1 RFC)
+
+- Binary/PDF document parsing (Tika, etc.) - callers supply `raw_text`
+- Training, evaluation, or model-update RPCs
+- Embedding `.bin` model bytes in request messages (models remain server-side)
+- Authentication / multi-tenancy in the core API (deployment concern: mTLS,
reverse proxy)
+- Coreference (documented in manual but not implemented in current codebase)
+
+### Compatibility
+
+- **Additive** Maven modules; no breaking changes to `opennlp-api` /
`opennlp-runtime`
+- Sandbox granular services may be deprecated or moved to `opennlp.legacy.v1`
after migration
+
+### Phased delivery (high level)
+
+| Phase | Scope |
+|-------|--------|
+| **0** | This JIRA + community RFC (this ticket) |
+| **1** | Design document + full `.proto` definitions (no server code required
for consensus) |
+| **2+** | Implementation: orchestrator, server, tests, graduation from
sandbox to main repo |
+| **Later** | Advanced GPU provider modules (CUDA via onnxruntime-gpu,
OpenVINO), richer discovery, streaming, additional steps; core `Document`
interface graduation if not in 3.0.0-M4 |
+
+### Design highlights
+
+1. **Three proto layers (NLP-only):** domain types (`OpenNlpDocument`),
pipeline config (`AnalysisProfile`), service (`OpenNlpAnalysisService`)
+2. **Offset contract:** All exported spans use **character offsets in the
original `raw_text`** (`CHAR_DOCUMENT`), half-open `[start, end)` matching
`opennlp.tools.util.Span`
+3. **Model bundles:** Replace per-RPC `model_hash` with `ModelBundleRef` +
server-defined profiles (reuse sandbox model discovery patterns)
+4. **Thread safety:** Leverage OpenNLP 3.0 thread-safe `*ME` instances cached
per model bundle
+
+### Sample protobuf (illustrative - full spec in design doc)
+
+The following is a **short sketch** for discussion; field numbers and optional
messages may change during RFC.
+
+**Important (per community feedback on OPENNLP-1833):** Chunking and
embeddings are **in scope for v1**, not deferred. The full protobuf definitions
(including `PipelineStep.CHUNK` and `EMBED`, `ChunkResult`/`ChunkSpan`,
`EmbeddingResult`, `InferenceBackend`, richer `ModelBundleInfo` for discovery,
etc.) live in the companion design document `docs/rfc/opennlp-grpc-design.md`.
The short sketch below is intentionally minimal. GPU hot-swap (CUDA, OpenVINO)
is achieved via a provider SPI beh [...]
+
+```protobuf
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// --- Layer 1: Document ---
+
+message OpenNlpDocument {
+ string doc_id = 1;
+ string raw_text = 2;
+ optional string detected_language = 3;
+ optional float language_confidence = 4;
+ repeated AnnotatedSentence sentences = 5;
+ map<string, string> metadata = 6;
+}
+
+message AnnotatedSentence {
+ CharSpan sentence_span = 1;
+ repeated Token tokens = 2;
+ repeated NamedEntity entities = 3;
+}
+
+message Token {
+ string text = 1;
+ CharSpan char_span = 2;
+ optional string pos_tag = 3;
+}
+
+message NamedEntity {
+ CharSpan char_span = 1;
+ string entity_type = 2;
+ optional double prob = 3;
+}
+
+message CharSpan {
+ int32 start = 1;
+ int32 end = 2;
+ CoordinateSpace space = 3;
+ optional string type = 4;
+ optional double prob = 5;
+}
+
+enum CoordinateSpace {
+ COORDINATE_SPACE_UNSPECIFIED = 0;
+ CHAR_DOCUMENT = 1;
+}
+
+// --- Layer 2: Pipeline ---
+
+enum PipelineStep {
+ PIPELINE_STEP_UNSPECIFIED = 0;
+ LANGUAGE_DETECT = 1;
+ SENTENCE_DETECT = 2;
+ TOKENIZE = 3;
+ POS_TAG = 4;
+ NER = 5;
+}
+
+message AnalysisProfile {
+ string profile_id = 1;
+ repeated PipelineStep steps = 2;
+ ModelBundleRef model_bundle = 3;
+}
+
+message ModelBundleRef {
+ string bundle_id = 1;
+}
+
+message AnalysisOptions {
+ bool include_probabilities = 1;
+ bool clear_adaptive_data = 2;
+}
+
+// --- Layer 3: Service ---
+
+service OpenNlpAnalysisService {
+ rpc AnalyzeDocument(AnalyzeDocumentRequest) returns
(AnalyzeDocumentResponse);
+ rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+}
+
+message AnalyzeDocumentRequest {
+ OpenNlpDocument document = 1;
+ AnalysisProfile profile = 2;
+ AnalysisOptions options = 3;
+}
+
+message AnalyzeDocumentResponse {
+ OpenNlpDocument document = 1;
+ repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+ PipelineStep step = 1;
+ string message = 2;
+ DiagnosticSeverity severity = 3;
+}
+
+enum DiagnosticSeverity {
+ DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+ INFO = 1;
+ WARNING = 2;
+ ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+message GetServiceInfoResponse {
+ string opennlp_version = 1;
+ string api_version = 2;
+ repeated string available_profile_ids = 3;
+}
+```
+
+### Comparison: sandbox vs proposed
+
+| Aspect | Sandbox POC | Proposed |
+|--------|---------------|----------|
+| Services | 3 (sent / token / POS) | 1 primary (`OpenNlpAnalysisService`) |
+| I/O | Strings + `StringList` | `OpenNlpDocument` |
+| Models | `model_hash` per RPC | `ModelBundleRef` + profiles |
+| Pipeline | Client-side chaining | Server-side `AnalysisProfile` |
+| Package | `package opennlp` | `org.apache.opennlp.grpc.v1` |
+
+### References
+
+- Sandbox POC: https://github.com/apache/opennlp-sandbox/tree/main/opennlp-grpc
+- Current sandbox proto:
https://github.com/apache/opennlp-sandbox/blob/main/opennlp-grpc/opennlp-grpc-api/opennlp.proto
+- UIMA composite pipeline:
`opennlp-extensions/opennlp-uima/descriptors/OpenNlpTextAnalyzer.xml`
+- ONNX / GPU: `opennlp-dl`, `opennlp-dl-gpu`, `SentenceVectorsDL`
+- Full design document (companion): `docs/rfc/opennlp-grpc-design.md` in
contributor branch or attachment
+
+### Questions for the community (with initial feedback summary)
+
+1. Should v1 expose **only** `AnalyzeDocument`, or retain sandbox granular
RPCs under a legacy package?
+ - **Community preference (Martin + consensus direction):** Retain the
existing granular services under a legacy package
(`org.apache.opennlp.grpc.legacy.v1` or similar) for a transition period. New
development and clients should use the primary document-centric
`OpenNlpAnalysisService`.
+
+2. Target release: **3.0.x** (additive) vs **3.1**?
+ - **Community view (Martin):** More likely **3.1.x**. 3.0.0 is approaching
a release (target end of June / early July 2026 or shortly thereafter). The
gRPC work is substantial and additive; landing it after the 3.0 cut reduces
risk.
+
+3. Preferred home: graduate into **apache/opennlp** vs remain in
**opennlp-sandbox** until stable?
+ - **Community direction (Martin):** Start and iterate in the
**opennlp-sandbox** (as is already underway on the feature branch). Graduate
stable modules into `apache/opennlp` in future cycles once the design has had
review and the implementation has proven itself. A neutral core `Document`
interface (if adopted) could land earlier in 3.0.0-M4 as a small additive API
change.
+
+4. Proto tooling: Maven `protobuf-maven-plugin` only, or also publish to Buf
Schema Registry?
+ - **Strong community preference (Martin, Richard):** Stay with **Maven +
protobuf-maven-plugin** only for consistency with the rest of the OpenNLP
project. No Gradle. Buf publication can be considered later as a non-blocking
enhancement.
+
+---
+
+## Reporter notes (do not paste)
+
+- Attach or link `docs/rfc/opennlp-grpc-design.md` when available
+- Discuss on [email protected] after filing
+- Link this JIRA from any sandbox PR that implements the new protos
diff --git
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
new file mode 100644
index 00000000..c6e9e2f6
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_document.proto
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+// Canonical 1:1 NLP document: text in, annotations out.
+// The sentences list provides the shared base linguistic analysis
+// (computed once even when multiple chunk+embed groups are requested).
+message OpenNlpDocument {
+ string doc_id = 1;
+ string raw_text = 2;
+ optional string detected_language = 3;
+ optional float language_confidence = 4;
+ repeated AnnotatedSentence sentences = 5;
+ optional DocumentAnalytics analytics = 6;
+ map<string, string> metadata = 7;
+ repeated EmbeddingResult embeddings = 8; // denormalized convenience "all
vectors + spans"
+ optional DocumentClassification classification = 9;
+
+ // Primary way to carry multiple independent chunk+embedding result groups
+ // from one analysis. Each group corresponds to one chunking strategy with
+ // its explicitly requested embedding models attached inside the chunks.
+ repeated ChunkEmbeddingGroup chunk_embedding_groups = 10;
+}
+
+message AnnotatedSentence {
+ CharSpan sentence_span = 1;
+ repeated Token tokens = 2;
+ repeated NamedEntity entities = 3;
+ // Classic syntactic chunks (from ChunkerME or similar), per sentence.
+ optional ChunkResult syntactic_chunks = 4;
+ optional ParseTree parse_tree = 5;
+ optional string sentiment_label = 6;
+ optional float sentiment_confidence = 7;
+}
+
+message Token {
+ string text = 1;
+ CharSpan char_span = 2;
+ optional string pos_tag = 3;
+ optional string lemma = 4;
+ optional float pos_probability = 5;
+}
+
+message NamedEntity {
+ CharSpan char_span = 1;
+ string entity_type = 2;
+ optional double probability = 3;
+}
+
+message CharSpan {
+ int32 start = 1;
+ int32 end = 2;
+ CoordinateSpace space = 3;
+ optional string type = 4;
+ optional double probability = 5;
+}
+
+enum CoordinateSpace {
+ COORDINATE_SPACE_UNSPECIFIED = 0;
+ CHAR_DOCUMENT = 1;
+ TOKEN_SENTENCE = 2;
+}
+
+message DocumentAnalytics {
+ int32 total_tokens = 1;
+ int32 total_sentences = 2;
+ float noun_density = 3;
+ float verb_density = 4;
+ float adjective_density = 5;
+ float adverb_density = 6;
+ float content_word_ratio = 7;
+ int32 unique_lemma_count = 8;
+ float lexical_density = 9;
+}
+
+// Lightweight result for classic syntactic chunking (ChunkerME style)
+// attached to sentences. Distinct from the strategy-driven chunks below.
+message ChunkResult {
+ repeated ChunkSpan chunks = 1;
+}
+
+message ChunkSpan {
+ CharSpan char_span = 1;
+ string chunk_tag = 2;
+}
+
+// A chunk produced by a chunking strategy. The strategy and the list of
+// embedding models are declared on the containing ChunkEmbeddingGroup.
+// Embeddings for the requested models are carried inside this chunk.
+message Chunk {
+ CharSpan char_span = 1;
+ optional string chunk_tag = 2;
+ // The text of the chunk (for client convenience; authoritative bounds
+ // are given by char_span over the document raw_text).
+ optional string text_content = 3;
+ // Multiple embedding models per chunk, as requested for the group/strategy.
+ repeated EmbeddingResult embeddings = 4;
+ // Optional navigation aid: indices into the document's sentences list.
+ repeated int32 contained_sentence_indices = 5;
+}
+
+// One chunking strategy's output: the chunks (with their embeddings inside)
+// plus traceability for the strategy and the exact embedding models that
+// were asked for this strategy in the request.
+message ChunkEmbeddingGroup {
+ string group_id = 1;
+ optional string chunk_config_id = 2;
+ repeated string embedding_model_ids = 3; // exactly as named for this
strategy
+ optional string result_set_name = 4;
+ repeated Chunk chunks = 5;
+ map<string, string> metadata = 6;
+ optional EmbeddingGranularity granularity = 7;
+}
+
+message ParseTree {
+ ParseNode root = 1;
+}
+
+message ParseNode {
+ string label = 1;
+ CharSpan span = 2;
+ repeated ParseNode children = 3;
+ optional double probability = 4;
+}
+
+message EmbeddingResult {
+ string model_id = 1;
+ repeated float vector = 2;
+ CharSpan source_span = 3;
+ EmbeddingGranularity granularity = 4;
+}
+
+enum EmbeddingGranularity {
+ EMBEDDING_GRANULARITY_UNSPECIFIED = 0;
+ DOCUMENT = 1;
+ SENTENCE = 2;
+ CHUNK = 3; // embeddings attached to chunks of a strategy/group
+ reserved 4 to 10;
+}
+
+message DocumentClassification {
+ string best_category = 1;
+ map<string, double> category_scores = 2;
+}
diff --git
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
new file mode 100644
index 00000000..39a7fef3
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_pipeline.proto
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "org/apache/opennlp/grpc/v1/opennlp_document.proto";
+
+// Pipeline steps supported by AnalyzeDocument (and future streaming variants).
+enum PipelineStep {
+ PIPELINE_STEP_UNSPECIFIED = 0;
+ LANGUAGE_DETECT = 1;
+ SENTENCE_DETECT = 2;
+ TOKENIZE = 3;
+ POS_TAG = 4;
+ NER = 5;
+ CHUNK = 6; // segmentation-style or classic syntactic chunking
+ PARSE = 7;
+ LEMMATIZE = 8;
+ DOC_CATEGORIZE = 9;
+ SENTIMENT = 10;
+ EMBED = 11;
+}
+
+// Configuration for a chunking strategy (used when the caller does not
+// supply a full AnalysisProfile for the entry).
+message ChunkingSpec {
+ // Algorithm: token, sentence, character, semantic (future), etc.
+ string algorithm = 1; // e.g. "token", "sentence"
+ int32 chunk_size = 2;
+ int32 chunk_overlap = 3;
+ bool clean_text = 4;
+ bool preserve_urls = 5;
+ // For semantic chunking (topic boundaries via embeddings).
+ optional SemanticChunkingConfig semantic_config = 6;
+}
+
+message SemanticChunkingConfig {
+ float similarity_threshold = 1;
+ int32 percentile_threshold = 2;
+ int32 min_chunk_sentences = 3;
+ int32 max_chunk_sentences = 4;
+}
+
+enum POSTagFormat {
+ POS_TAG_FORMAT_UNSPECIFIED = 0;
+ UD = 1;
+ PENN = 2;
+ CUSTOM = 3;
+}
+
+enum InferenceBackend {
+ INFERENCE_BACKEND_UNSPECIFIED = 0;
+ OPENNLP_ME = 1; // classic *ME
+ ONNX_RUNTIME = 2;
+ ONNX_RUNTIME_GPU = 3; // CUDA etc. via onnxruntime-gpu (opennlp-dl-gpu)
+ // OpenVINO / DJL / other providers are reserved for separate optional
modules.
+ reserved 4 to 9;
+ reserved "OPENVINO", "DJL";
+}
+
+message AnalysisProfile {
+ string profile_id = 1;
+ repeated PipelineStep steps = 2;
+ ModelBundleRef model_bundle = 3;
+ POSTagFormat pos_tag_format = 4;
+ repeated string ner_entity_types = 5;
+}
+
+message ModelBundleRef {
+ string bundle_id = 1;
+ map<string, string> component_keys = 2;
+}
+
+message AnalysisOptions {
+ bool include_probabilities = 1;
+ bool clear_adaptive_data = 2;
+ InferenceBackend inference_backend = 3;
+ optional int32 max_text_length = 4;
+ optional string onnx_embedding_model_id = 5;
+}
+
+message ModelDescriptor {
+ string hash = 1;
+ string name = 2;
+ string locale = 3;
+ string component_type = 4;
+ repeated string languages = 5;
+ repeated PipelineStep supported_steps = 6;
+ map<string, string> attributes = 7;
+}
+
+message ModelBundleInfo {
+ string bundle_id = 1;
+ repeated ModelDescriptor models = 2;
+ repeated string supported_languages = 3;
+ repeated PipelineStep supported_steps = 4;
+}
diff --git
a/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
new file mode 100644
index 00000000..ddf3a066
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-api/src/main/proto/org/apache/opennlp/grpc/v1/opennlp_service.proto
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+
+package org.apache.opennlp.grpc.v1;
+
+option java_package = "org.apache.opennlp.grpc.v1";
+option java_multiple_files = true;
+
+import "org/apache/opennlp/grpc/v1/opennlp_document.proto";
+import "org/apache/opennlp/grpc/v1/opennlp_pipeline.proto";
+
+service OpenNlpAnalysisService {
+ rpc AnalyzeDocument(AnalyzeDocumentRequest) returns
(AnalyzeDocumentResponse);
+ rpc GetServiceInfo(GetServiceInfoRequest) returns (GetServiceInfoResponse);
+ rpc ListModelBundles(ListModelBundlesRequest) returns
(ListModelBundlesResponse);
+}
+
+message AnalyzeDocumentRequest {
+ OpenNlpDocument document = 1;
+ // Classic single-profile usage.
+ AnalysisProfile profile = 2;
+ AnalysisOptions options = 3;
+ optional string profile_id = 4;
+
+ // Per-chunking-strategy multi-config. Chunking first; for each strategy
+ // the caller names exactly which embedding_model_ids to attach to its
chunks.
+ // The server shares the base NLP analysis (sentences etc.) across all
entries.
+ repeated ChunkEmbedConfigEntry chunk_embed_configs = 5;
+}
+
+// One chunking strategy + the concrete embedding models to use for the chunks
+// it produces. Corresponds 1:1 to one ChunkEmbeddingGroup in the reply.
+message ChunkEmbedConfigEntry {
+ string config_id = 1; // becomes the group's group_id
+ optional string result_set_name = 2;
+ optional AnalysisProfile profile = 3;
+ optional ChunkingSpec chunking = 4;
+ // The embeddings wanted for *this* chunking strategy's chunks.
+ // Not a blind NxM; explicit per-strategy list.
+ repeated string embedding_model_ids = 5;
+}
+
+message AnalyzeDocumentResponse {
+ OpenNlpDocument document = 1;
+ repeated ProcessingDiagnostic diagnostics = 2;
+}
+
+message ProcessingDiagnostic {
+ PipelineStep step = 1;
+ string message = 2;
+ DiagnosticSeverity severity = 3;
+ optional string component_key = 4;
+}
+
+enum DiagnosticSeverity {
+ DIAGNOSTIC_SEVERITY_UNSPECIFIED = 0;
+ INFO = 1;
+ WARNING = 2;
+ ERROR = 3;
+}
+
+message GetServiceInfoRequest {}
+
+message GetServiceInfoResponse {
+ string opennlp_version = 1;
+ string api_version = 2;
+ repeated string available_profile_ids = 3;
+ repeated PipelineStep supported_steps = 4;
+}
+
+message ListModelBundlesRequest {}
+
+message ListModelBundlesResponse {
+ repeated ModelBundleInfo bundles = 1;
+}
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
b/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
index 70ec586d..caf817bc 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/jsmlearning/TreeKernelRunner.java
@@ -122,7 +122,7 @@ svm_learn -t 5 -D 0 learning_file model_file - другой
вÐ
2. svm_classify.exe берет файл � те�товыми
примерами, файл � моделью, по�троенный
svm_learn, и запи�ывает результаты обучени� в
файл predictions_file.
-Запу�к: svm_classify example_file model_file predictions_file
+Ð-апуÑ�к: svm_classify example_file model_file predictions_file
Файл имеет тот же формат, что и входные
примеры. Образец лежит в архиве на
�траничке Мо�китти.
Можно �разу же указывать, к какому кла��у
отно�ит�� пример (1 или -1 в начале �троки). В
�том �лучае точно�ть и полнота оценивают��
автоматиче�ки. Или �тавить там 0.
diff --git a/opennlp-similarity/src/test/resources/sentence_parseObject.csv
b/opennlp-similarity/src/test/resources/sentence_parseObject.csv
index c11ec1d1..e1f4622e 100644
--- a/opennlp-similarity/src/test/resources/sentence_parseObject.csv
+++ b/opennlp-similarity/src/test/resources/sentence_parseObject.csv
@@ -254,7 +254,7 @@
"B-NP","B-VP","I-VP","O"
"NNP","VBD","VBG","CC"
"Albert","began","reading","and"
-"The Patriot Post — IRS Target of . the day before the first Sensitive Case
Reports on conservative groups were . Obama said that if not for . ."
+"The Patriot Post - IRS Target of . the day before the first Sensitive Case
Reports on conservative groups were . Obama said that if not for . ."
"B-NP","I-NP","I-NP","I-NP","I-NP","B-PP","B-NP","I-NP","B-PP","B-NP","I-NP","I-NP","I-NP","I-NP","B-PP","B-NP","I-NP","B-VP","B-NP","B-VP","B-SBAR","B-PP","B-NP","I-NP"
"DT","NNP","NNP","NNP","NNP","IN","DT","NN","IN","DT","JJ","NN","NN","NNS","IN","JJ","NNS","VBD","NNP","VBD","IN","IN","RB","IN"
"The","Patriot","Post","IRS","Target","of","the","day","before","the","first","Sensitive","Case","Reports","on","conservative","groups","were","Obama","said","that","if","not","for"
@@ -630,7 +630,7 @@
"B-NP","I-NP","B-VP","B-NP","B-VP","I-VP","B-PRT","B-PP","B-NP","I-NP","I-NP","B-NP","I-NP","B-NP","I-NP","B-PP","B-NP","I-NP","I-NP","B-VP","I-VP","I-VP","I-VP","B-NP","I-NP","B-PP","B-NP","I-NP","B-VP","B-PP","B-NP","B-PP","B-NP","I-NP","I-NP"
"NNP","NNS","VBP","PRP","VBP","VBN","IN","IN","DT","JJ","NN","DT","NN","DT","NN","IN","DT","JJ","NNS","VBN","TO","VB","VB","JJ","NNS","IN","DT","NN","VBZ","IN","NN","IN","VBN","JJ","NN"
"WASHINGTON","Lawmakers","say","they","re","outraged","that","for","the","second","time","this","month","a","member","of","the","armed","forces","assigned","to","help","prevent","sexual","assaults","in","the","military","is","under","investigation","for","alleged","sexual","misconduct"
-"Albert Einstein, (born March 14, 1879, Ulm, Württemberg, Germany—died April
18, 1955, Princeton, New Jersey, U.S "
+"Albert Einstein, (born March 14, 1879, Ulm, Württemberg, Germany-died April
18, 1955, Princeton, New Jersey, U.S "
"B-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP"
"NNP","NNP","VBN","NNP","CD","CD","NNP","NNP","NNP","VBD","NNP","CD","CD","NNP","NNP","NNP","NNP","NNP"
"Albert","Einstein","born","March","14","1879","Ulm","Württemberg","Germany","died","April","18","1955","Princeton","New","Jersey","U","S"
@@ -858,7 +858,7 @@
"B-NP","B-PP","I-NP","I-NP","I-NP","B-PP","B-NP","B-VP","B-NP","B-ADJP","B-NP","I-NP","B-NP","I-NP","B-PP","B-NP","B-VP","B-NP","B-VP","B-NP","B-PP","B-NP","I-NP","I-NP","B-PP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","I-NP","B-SBAR","B-NP","B-VP","B-NP","I-NP","I-NP","I-NP","O","B-NP","B-VP","B-NP","I-NP","I-NP","I-NP","I-NP","I-NP","O"
"PRP","CD","CD","NNP","NN","IN","NNS","NNS","NNS","JJ","CD","NNS","CD","NNS","IN","EX","VBP","JJ","VBP","CD","IN","CD","JJ","NNS","IN","NN","NNS","NNS","NN","VBG","NNS","FW","NNP","NNP","VBD","DT","JJ","NN","NN","NNP","NNP","NNP","IN","PRP","VBP","JJ","NNP","NN","NNS","CC","NNP","VBD","NNS","NNP","NNP","NNP","NNP","NNP","CD"
"Item","361","380","Profile","picture","of","djones","djones","djones","active","6","months","3","weeks","ago","There","are","many","buy","one","get","one","free","offers","for","area","restaurants","museums","zoo","sporting","events","etc","Vicki","Todd","wrote","a","new","blog","post","PLEASE","REMEMBER","NOTE","although","we","collect","Swiss","Valley","milk","caps","and","Campbell","s","djones","Rock","Island","Milan","School","District","41"
-"Mark Alexander: Obama's 'IRS Enemies List' — Updated ..."
+"Mark Alexander: Obama's 'IRS Enemies List' - Updated ..."
"B-NP","I-NP","I-NP","B-VP","B-NP","I-NP","B-VP","I-VP"
"NN","NN","NN","VBZ","JJ","NNS","NN","VBN"
"Mark","Alexander","Obama","s","IRS","Enemies","List","Updated"