aiworkerjohns opened a new issue, #3768:
URL: https://github.com/apache/jena/issues/3768

   ## Status: Done
   
   ## Problem
   
   The upstream `jena-text` module uses a triple-per-document model — each RDF 
triple matching the entity map creates a separate Lucene document. This means:
   
   - Facet values end up on different documents from text values, requiring 
expensive URI joins
   - Same entity can produce multiple documents, causing overcounting
   - No support for typed fields (numeric, keyword) — everything is text
   - Cannot use Lucene's advanced features (DrillSideways, DocValues) that 
require all fields on one document
   
   ## Use Case
   
   ```mermaid
   flowchart LR
       subgraph RDF["RDF Triples"]
           t1["ex:book1 rdf:type ex:Book"]
           t2["ex:book1 rdfs:label 'Machine Learning'"]
           t3["ex:book1 ex:category 'Technology'"]
           t4["ex:book1 ex:year 2024"]
       end
   
       subgraph Doc["Lucene Document"]
           d1["uri: ex:book1"]
           d2["docType: Book"]
           d3["title: 'Machine Learning' (TEXT)"]
           d4["category: 'Technology' (KEYWORD)"]
           d5["year: 2024 (INT)"]
       end
   
       RDF -- "SHACL shape defines mapping" --> Doc
   ```
   
   Applications that need to index RDF entities with typed properties — data 
catalogues, knowledge graphs, document repositories — can use SHACL shapes as 
both data model and index definition.
   
   ## Technical Work (completed)
   
   - `ShaclIndexMapping` — parsed data model: `IndexProfile` (shape), 
`FieldDef` (field), `FieldType` enum (TEXT, KEYWORD, INT, LONG, DOUBLE)
   - `ShaclIndexAssembler` — parses `text:shapes` RDF config into 
`ShaclIndexMapping`
   - `TextIndexLucene` extended with entity document building 
(`docFromMapping()`) and typed field support
   - `Entity` extended with `addValue()` for multi-valued fields
   - `TextIndexConfig` extended with SHACL mapping and facet field configuration
   - Assembler wiring: `TextIndexLuceneAssembler` detects `text:shapes`, 
`TextDatasetAssembler` auto-creates SHACL producer
   
   **Field types:**
   
   | Type | Lucene Field | DocValues | Use |
   |------|-------------|-----------|-----|
   | TEXT | TextField | — | Full-text search |
   | KEYWORD | StringField | SortedSetDocValues | Exact match, faceting |
   | INT | IntPoint + StoredField | NumericDocValues | Range queries |
   | LONG | LongPoint + StoredField | NumericDocValues | Range queries |
   | DOUBLE | DoublePoint + StoredField | NumericDocValues | Range queries |
   
   ## Effort
   
   Completed. 16 files changed, +2,594 lines of Java source.
   
   ## Decisions Made
   
   - **SHACL vocabulary** over extending `text:entityMap` — maps naturally to 
"entity with typed properties" without adding complexity to existing config 
format
   - **Coexistence** — `text:shapes` and `text:entityMap` are mutually 
exclusive per index but both codepaths are present. No migration forced.
   - **Discriminator field** — `docType` StringField with target class local 
name for multi-shape indexes
   - **No jena-shacl dependency** — config is read using standard Jena RDF API
   
   ## Pitfalls / Gotchas
   
   - Switching between modes requires a full reindex — document structures are 
fundamentally different
   - `FacetsConfig` must be consistent between index and read time — changing a 
field from single to multi-valued requires reindex
   - Multiple `sh:targetClass` per shape has limited test coverage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to