aiworkerjohns opened a new issue, #3768:
URL: https://github.com/apache/jena/issues/3768
## Status: Done
## Problem
The upstream `jena-text` module uses a triple-per-document model — each RDF
triple matching the entity map creates a separate Lucene document. This means:
- Facet values end up on different documents from text values, requiring
expensive URI joins
- Same entity can produce multiple documents, causing overcounting
- No support for typed fields (numeric, keyword) — everything is text
- Cannot use Lucene's advanced features (DrillSideways, DocValues) that
require all fields on one document
## Use Case
```mermaid
flowchart LR
subgraph RDF["RDF Triples"]
t1["ex:book1 rdf:type ex:Book"]
t2["ex:book1 rdfs:label 'Machine Learning'"]
t3["ex:book1 ex:category 'Technology'"]
t4["ex:book1 ex:year 2024"]
end
subgraph Doc["Lucene Document"]
d1["uri: ex:book1"]
d2["docType: Book"]
d3["title: 'Machine Learning' (TEXT)"]
d4["category: 'Technology' (KEYWORD)"]
d5["year: 2024 (INT)"]
end
RDF -- "SHACL shape defines mapping" --> Doc
```
Applications that need to index RDF entities with typed properties — data
catalogues, knowledge graphs, document repositories — can use SHACL shapes as
both data model and index definition.
## Technical Work (completed)
- `ShaclIndexMapping` — parsed data model: `IndexProfile` (shape),
`FieldDef` (field), `FieldType` enum (TEXT, KEYWORD, INT, LONG, DOUBLE)
- `ShaclIndexAssembler` — parses `text:shapes` RDF config into
`ShaclIndexMapping`
- `TextIndexLucene` extended with entity document building
(`docFromMapping()`) and typed field support
- `Entity` extended with `addValue()` for multi-valued fields
- `TextIndexConfig` extended with SHACL mapping and facet field configuration
- Assembler wiring: `TextIndexLuceneAssembler` detects `text:shapes`,
`TextDatasetAssembler` auto-creates SHACL producer
**Field types:**
| Type | Lucene Field | DocValues | Use |
|------|-------------|-----------|-----|
| TEXT | TextField | — | Full-text search |
| KEYWORD | StringField | SortedSetDocValues | Exact match, faceting |
| INT | IntPoint + StoredField | NumericDocValues | Range queries |
| LONG | LongPoint + StoredField | NumericDocValues | Range queries |
| DOUBLE | DoublePoint + StoredField | NumericDocValues | Range queries |
## Effort
Completed. 16 files changed, +2,594 lines of Java source.
## Decisions Made
- **SHACL vocabulary** over extending `text:entityMap` — maps naturally to
"entity with typed properties" without adding complexity to existing config
format
- **Coexistence** — `text:shapes` and `text:entityMap` are mutually
exclusive per index but both codepaths are present. No migration forced.
- **Discriminator field** — `docType` StringField with target class local
name for multi-shape indexes
- **No jena-shacl dependency** — config is read using standard Jena RDF API
## Pitfalls / Gotchas
- Switching between modes requires a full reindex — document structures are
fundamentally different
- `FacetsConfig` must be consistent between index and read time — changing a
field from single to multi-valued requires reindex
- Multiple `sh:targetClass` per shape has limited test coverage
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]