krickert opened a new pull request, #2916:
URL: https://github.com/apache/tika/pull/2916
## Summary
This change replaces the flat `map<string,string>` on `FetchAndParseReply`
with a
typed `org.apache.tika.grpc.v1.ParseResponse`. Parse metadata is mapped from
Tika
`Metadata` into protobuf messages aligned with Tika property interfaces
(PDF, Office,
HTML, and the other supported formats), with Dublin Core normalized at the
response
root and Creative Commons licensing carried on a dedicated field when
present.
The work is split into three Maven modules so clients can depend on the
schema and
generated stubs without pulling in the server, and so mapping logic stays
testable
outside the gRPC layer:
- **tika-grpc-api** — protobuf sources, Java generation, bundled
`FileDescriptorSet`
- **tika-grpc-mapper** — `ParseResponseMapper` and format builders; optional
`ParseResponseDecorator` for future extensions (for example document
outlines)
- **tika-grpc** — existing service; `FetchAndParseReply.parse_response`
replaces
the removed `fields` map
This is a breaking change for gRPC clients that read string keys from
`fields`.
Migration is documented in `tika-grpc-api/README.md` and summarized in
`tika-grpc/README.md`.
## Architecture
### Module dependencies
```mermaid
flowchart TB
subgraph clients [Clients]
C[gRPC / Java clients]
end
subgraph server [tika-grpc]
S[TikaGrpcServerImpl]
P[Tika Pipes workers]
end
subgraph mapper [tika-grpc-mapper]
M[ParseResponseMapper]
B[Format metadata builders]
D[ParseResponseDecorator optional]
end
subgraph api [tika-grpc-api]
PR[parse_response.proto and format protos]
FD[FileDescriptorSet in META-INF]
J[Generated Java stubs]
end
subgraph tika [Tika core]
AD[AutoDetectParser / Pipes]
MD[Metadata]
end
C -->|FetchAndParse| S
S --> P
P --> AD
AD --> MD
S --> M
M --> B
M --> D
M --> PR
B --> PR
PR --> J
PR --> FD
S -->|FetchAndParseReply.parse_response| C
```
### Parse pipeline
```mermaid
sequenceDiagram
participant Client
participant Grpc as TikaGrpcServerImpl
participant Pipes as Tika Pipes
participant Parser as Tika parser
participant Mapper as ParseResponseMapper
Client->>Grpc: FetchAndParseRequest
Grpc->>Pipes: fetch and parse
Pipes->>Parser: parse bytes
Parser-->>Pipes: Metadata, body text
Pipes-->>Grpc: parse result
Grpc->>Mapper: map metadata, body, status
Mapper->>Mapper: detect format oneof
Mapper->>Mapper: build dublin_core
Mapper->>Mapper: optional CC overlay
Mapper-->>Grpc: ParseResponse
Grpc-->>Client: FetchAndParseReply.parse_response
```
### ParseResponse layout
```mermaid
classDiagram
class ParseResponse {
string parse_id
ParseStatus status
ParseContent content
DublinCoreMetadata dublin_core
oneof document_metadata
CreativeCommonsMetadata creative_commons
repeated EmbeddedDocument embedded_docs
}
class ParseContent {
string body
string title
}
class PdfMetadata
class OfficeMetadata
class HtmlMetadata
class ImageMetadata
class GenericMetadata
ParseResponse --> ParseContent
ParseResponse --> DublinCoreMetadata
ParseResponse --> PdfMetadata : pdf
ParseResponse --> OfficeMetadata : office
ParseResponse --> HtmlMetadata : html
ParseResponse --> ImageMetadata : image
ParseResponse --> GenericMetadata : generic
```
## Breaking API change
| Removed | Replacement
|
| ----------------------------------------------- |
--------------------------------------------- |
| `FetchAndParseReply.fields` (field 2, reserved) |
`FetchAndParseReply.parse_response` (field 5) |
Example client reads:
- Body text: `parse_response.content.body`
- Title: `parse_response.content.title` or `parse_response.dublin_core.title`
- PDF producer: `parse_response.pdf.doc_info_producer`
- Parse status: `parse_response.status`
## Modules added or updated
| Path | Notes
|
| ------------------- |
-------------------------------------------------------------------------------------------
|
| `tika-grpc-api/` | ~17 proto files under `org/apache/tika/grpc/v1/`;
buf lint config; descriptor bundle |
| `tika-grpc-mapper/` | Builders ported from prior Pipestream mapper work;
35 unit tests against Tika test fixtures |
| `tika-grpc/` | Depends on api + mapper; `TikaGrpcServerImpl` uses
`ParseResponseMapper` |
| `tika-bom/pom.xml` | Lists new artifacts
|
| Root `pom.xml` | Reactor modules
|
## Test plan
- [ ] `./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test`
- [ ] Confirm `FetchAndParseReply` no longer exposes `fields`; clients read
`parse_response`
- [ ] Parse a PDF and an HTML sample via gRPC; verify typed `pdf` / `html`
oneof and `content.body`
- [ ] Confirm `tika-grpc-api` jar contains
`META-INF/org.apache.tika.grpc.v1.descriptors`
- [ ] Review breaking change note with downstream consumers before release
## Follow-up (not in this PR)
- Outline decorators (`ParseResponseDecorator`) for PDF/HTML heading trees
when proto fields are added
- Expand `ClimateForecastMetadataBuilder` beyond additional_struct mapping
- Downstream consumer updates in separate repositories
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Thanks for your contribution to [Apache Tika](https://tika.apache.org/)!
Your help is appreciated!
Before opening the pull request, please verify that
* there is an open issue on the [Tika issue
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the
problem or the improvement. We cannot accept pull requests without an issue
because the change wouldn't be listed in the release notes.
* the issue ID (`TIKA-XXXX`)
- is referenced in the title of the pull request
- and placed in front of your commit messages surrounded by square
brackets (`[TIKA-XXXX] Issue or pull request title`)
* commits are squashed into a single one (or few commits for larger changes)
* Tika is successfully built and unit tests pass by running `./mvnw clean
test`
* there should be no conflicts when merging the pull request branch into the
*recent* `main` branch. If there are conflicts, please try to rebase the pull
request branch on top of a freshly pulled `main` branch
* if you add new module that downstream users will depend upon add it to
relevant group in `tika-bom/pom.xml`.
We will be able to faster integrate your pull request if these conditions
are met. If you have any questions how to fix your problem or about using Tika
in general, please sign up for the [Tika mailing
list](http://tika.apache.org/mail-lists.html). Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]