Hi all,

+1 from me on this direction. A document-centric, language-neutral contract is 
a clear step up from chaining the three string-based sandbox RPCs, and Martin's 
suggestion of a neutral opennlp-core Document interface (kept separate from the 
protobuf wire type) seems like the right way to avoid letting gRPC details leak 
into the core API.

On the build question (3): strong +1 for staying with Maven. The whole project 
is Maven, protobuf-maven-plugin covers proto generation cleanly, and 
introducing Gradle just for this module would add a second build system to 
maintain for no real benefit.

A few gaps I noticed in the proposal that might be worth resolving before the 
proto gets locked in Phase 2:

1. Embeddings aren't actually in the proto. Kristian's email leads with GPU 
embeddings (CUDA/OpenVINO hot-swap) as a primary goal, but the sketched proto 
has no embedding message and no embedding PipelineStep, and SentenceVectorsDL 
isn't represented. Either embeddings are v1 in which case OpenNlpDocument needs 
a vector field and PipelineStep needs an entry or they're deferred and the 
non-goals should say so. Right now it reads as a primary goal that the contract 
doesn't cover.

2. Chunking is listed as a gap but not added. The "Gap" section calls out 
NER/chunking/embeddings as missing, but the proto only adds NER. Chunking is a 
standard OpenNLP tool, so it's worth an explicit v1-or-deferred decision.

3. ModelBundleRef is underspecified. A bare bundle_id gives clients no way to 
discover which bundles/profiles exist or what languages and steps they support. 
GetServiceInfo returns profile IDs but no bundle metadata. IMHO, it might be 
worth having it enumerate bundles with their supported steps/languages.

4. Partial-failure semantics are undefined. ProcessingDiagnostic exists, but 
it's not stated whether a failed step fails the whole AnalyzeDocument call or 
returns a partial document with an ERROR diagnostic. That affects every client, 
so worth nailing down early.

5. clear_adaptive_data vs. the stateless contract. That option implies adaptive 
state carried across calls, but the contract is described as 1:1 stateless 
documents. Worth clarifying what adaptive data means in this model.

None of these block starting in the sandbox - they're Phase 1/2 proto-shape 
questions. Overall the direction has a lot of potential.

Gruß
Richard

> Am 05.06.2026 um 15:13 schrieb Martin Wiesner <[email protected]>:
> 
> Hi Kristian,
> 
> thx for the initiative which I’d like to support hereby. I’ve been 'off in 
> nature' for some days recently and thus my answer is delayed.
> 
> A document centric approach is well-motivated in the Jira. For reasons of 
> simplicity (and neutrality) we could add a opennlp-core api interface 
> ’Document’.
> This would allow us to model what a document is composed of, and (b) for 
> other components to (re-) implement it by related requirements / ideas, such 
> as outlined in OPENNLP-1833 by you („OpenNLPDocument“, „AnalyzeDocument“).
> 
> If you want a core-api addition, say for ‚Document‘ or the like, keep in mind 
> we can integrate it with the next 3.0.0-M4. 
> If this is not required / necessary in the first place: that is also fine - 
> we can refactor / extract later on.
> Currently, as is stands, we’re planning to cut a release at the end of June 
> or early July. If you want to start things by 
> 
> Working first, in the opennlp-sandbox and evolving the current state seems 
> reasonable, target being the core project in future cycles.
> 
> Proposed package naming is fine from my pov, cf. JIRA issue.
> 
> My views on your questions in the JIRA description:
> 
> ad (1): go for retain in legacy pkg
> ad (2): can imagine both paths, more likely is 3.1.x - as it feels 3.0.x is 
> at the door soon (over or at the end of the summer 2026).
> ad (3): stay with Maven (plz) if this is possible. Personally (!), no a big 
> fan of Gradle… - personally speaking here, no strong opinion 
> 
> Happy about other’s comments.
> 
> Thanks for the ideas and precise outline of 'em. The direction has a lot of 
> potential.
> 
> Best
> Martin | mawiesne
> 
> 
>> Am 22.05.2026 um 12:27 schrieb Kristian Rickert <[email protected]>:
>> 
>> Hi OpenNLP devs,
>> 
>> I've opened OPENNLP-1833 to propose evolving the opennlp-sandbox gRPC
>> POC into ASF-native modules with a canonical OpenNlpDocument message and
>> a primary AnalyzeDocument RPC (org.apache.opennlp.grpc.v1).
>> 
>> JIRA: https://issues.apache.org/jira/browse/OPENNLP-1833
>> 
>> Background: OpenNLP today is primarily in-process (API, CLI, UIMA).
>> The sandbox POC (opennlp-grpc) exposes three separate string-based
>> services; the ticket proposes a unified document contract and server-side
>> pipeline orchestration.
>> 
>> My primary goal is to integrate other language libraries through a gRPC
>> contract.  This will allow the server to work with OpenNLP.  OpenNLP can
>> use the client stubs to get data from the server, and the server would also
>> use OpenNLP to expose the API to other languages.
>> 
>> To be more specific: I'd like to introduce options that also utilize the
>> GPU more directly for embeddings.  CUDA for nvidia cards and OpenVINO for
>> Intel cards.  This would create a middle interface that can hot-swap on the
>> server side.  Of course, these interfaces would also be their own builds.
>> 
>> I'm planning to work on this in phases as outlined in the ticket:
>> 
>>  - Phase 0/1: community RFC + design doc / full .proto definitions
>>  - Phase 2+: implementation (will work on this while we discuss phase 1,
>>  but open for changes)
>> 
>> I'd appreciate feedback on a few points called out in the JIRA ticket.
>> 
>> I can get a prototype up within a couple of weeks.
>> 
>> Sandbox reference:
>> 
>> https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion
>> 
>> I'll post design updates and any draft .proto / docs to the ticket.
>> Comments on the JIRA or replies to this thread are welcome although JIRA is
>> preferred.
>> 
>> Thanks,
>> Kristian
> 

Reply via email to