aiworkerjohns opened a new issue, #3770:
URL: https://github.com/apache/jena/issues/3770

   ## Status: Done
   
   ## Problem
   
   There was no way to get facet counts (value distributions) from a text 
search in Jena. Applications needing "show me how many results per category" 
had to retrieve all results and count in application code, which is expensive 
and doesn't scale.
   
   ## Use Case
   
   ```mermaid
   flowchart LR
       Query["luc:facet<br/>'climate change'<br/>fields: category, publisher"]
       Cat["category:<br/>Environment (42)<br/>Policy (28)<br/>Science (15)"]
       Pub["publisher:<br/>CSIRO (31)<br/>BOM (22)<br/>DCCEEW (12)"]
   
       Query --> Cat
       Query --> Pub
   ```
   
   - Sidebar facet panels in a search UI
   - Summary statistics for a dataset collection
   - "Browse by" navigation (by theme, by organisation, by type)
   
   ## Technical Work (completed)
   
   - `TextFacetPF` — new property function registered under 
`urn:jena:lucene:index#facet`
   - Returns `(field, value, count)` bindings — one row per facet value
   - Uses Lucene `SortedSetDocValuesFacetCounts` for O(1) counting via 
pre-built DocValues
   - Supports `maxValues` (cap values per field), `minCount` (exclude rare 
values)
   - Supports filtered facets — same JSON filter format as `luc:query`
   - `FacetValue` — immutable (value, count) pair
   
   **SPARQL interface:**
   
   ```sparql
   (?field ?value ?count) luc:facet (queryString facetFields filter? maxValues? 
minCount?)
   ```
   
   ## Effort
   
   Completed. `TextFacetPF` is 354 lines. `FacetValue` is 73 lines.
   
   ## Decisions Made
   
   - **Separate PF** from `luc:query` — hits and facets have different 
cardinalities. A combined PF would create N*M cartesian product rows.
   - **SortedSetDocValues** — efficient counting without document iteration, 
but requires KEYWORD fields with `idx:facetable true`
   - **`maxValues=0` means all** — consistent with "no limit" semantics
   
   ## Pitfalls / Gotchas
   
   - Only KEYWORD fields with `idx:facetable true` can be faceted — TEXT and 
numeric fields are not supported
   - Enabling faceting on a field adds ~25% indexing overhead (DocValues built 
at write time)
   - High cardinality fields (e.g. URIs) can consume significant memory during 
facet collection — use `text:maxFacetHits` to cap
   - Adding `idx:facetable true` to an existing field requires a full reindex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to