aiworkerjohns opened a new issue, #3770:
URL: https://github.com/apache/jena/issues/3770
## Status: Done
## Problem
There was no way to get facet counts (value distributions) from a text
search in Jena. Applications needing "show me how many results per category"
had to retrieve all results and count in application code, which is expensive
and doesn't scale.
## Use Case
```mermaid
flowchart LR
Query["luc:facet<br/>'climate change'<br/>fields: category, publisher"]
Cat["category:<br/>Environment (42)<br/>Policy (28)<br/>Science (15)"]
Pub["publisher:<br/>CSIRO (31)<br/>BOM (22)<br/>DCCEEW (12)"]
Query --> Cat
Query --> Pub
```
- Sidebar facet panels in a search UI
- Summary statistics for a dataset collection
- "Browse by" navigation (by theme, by organisation, by type)
## Technical Work (completed)
- `TextFacetPF` — new property function registered under
`urn:jena:lucene:index#facet`
- Returns `(field, value, count)` bindings — one row per facet value
- Uses Lucene `SortedSetDocValuesFacetCounts` for O(1) counting via
pre-built DocValues
- Supports `maxValues` (cap values per field), `minCount` (exclude rare
values)
- Supports filtered facets — same JSON filter format as `luc:query`
- `FacetValue` — immutable (value, count) pair
**SPARQL interface:**
```sparql
(?field ?value ?count) luc:facet (queryString facetFields filter? maxValues?
minCount?)
```
## Effort
Completed. `TextFacetPF` is 354 lines. `FacetValue` is 73 lines.
## Decisions Made
- **Separate PF** from `luc:query` — hits and facets have different
cardinalities. A combined PF would create N*M cartesian product rows.
- **SortedSetDocValues** — efficient counting without document iteration,
but requires KEYWORD fields with `idx:facetable true`
- **`maxValues=0` means all** — consistent with "no limit" semantics
## Pitfalls / Gotchas
- Only KEYWORD fields with `idx:facetable true` can be faceted — TEXT and
numeric fields are not supported
- Enabling faceting on a field adds ~25% indexing overhead (DocValues built
at write time)
- High cardinality fields (e.g. URIs) can consume significant memory during
facet collection — use `text:maxFacetHits` to cap
- Adding `idx:facetable true` to an existing field requires a full reindex
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]