[dspace-tech] Facet filter matching fails for diacritics (â, special chars) in sidebar

Muhep Atasoy Wed, 05 Nov 2025 23:50:45 -0800


Hi all,


I'm running into a problem with sidebar facet filtering (browse-by-value / 
filter list) in a DSpace 8.x (Solr-backed) installation. The UI displays 
facet values correctly (with diacritics) but searching in the facet input 
(the small search box in the sidebar) behaves like a strict exact-match: 
when I type an ASCII version or remove diacritics, items that only differ 
by diacritics do not match. Example:

   - 
   
   Stored/display value: Hamzazâde Esad
   - 
   
   If the user types hamzazade in the facet search box, it does not return 
   the expected facet value or matching results.
   
What I found and tried

   - 
   
   The dynamic field for sidebar facets is *_filter:
   
<dynamicField name="*_filter" type="keywordFilter" indexed="true" 
stored="true" multiValued="true" omitNorms="true" /> 

   - 
   
   Current keywordFilter fieldType (originally):
   
<fieldType name="keywordFilter" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
    </analyzer>
</fieldType> 

This keeps the stored/display value exact (good), but facet search behaves 
like exact-match because the index token is the full original string.

   - 
   
   I attempted to add Turkish/ICU folding to the analyzer. When I add 
   folding to index analyzer, displayed facet strings started appearing 
   lowercased and diacritics lost (bad for presentation). So I tried to split 
   behaviors with index vs query analyzer:
   
<fieldType name="keywordFilter" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>

  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" 
maxGramSize="15"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

This left displayed values untouched and query normalization runs but still 
the problematic diacritic letter â (and some other special characters) does 
*not* reliably match when the user types ASCII a. I tested analyzer outputs 
in Solr Admin Analysis; query tokens become kamil from kâmil, but facets 
still don't return as expected.

Important constraints:

   - 
   
   DSpace code (the UI / server) expects author_filter style fields and 
   currently both search and display use the same *_filter field for facets — 
   I cannot change DSpace to use a separate *_search field easily.
   - 
   
   I cannot use WhitespaceTokenizer on query time because index uses 
   KeywordTokenizer (index tokens are whole values) and mismatch causes no 
   hits.
   
Questions / requests for help

   1. 
   
   Does the DSpace sidebar facet search use the same keywordFilter 
   fieldType defined above, or does DSpace apply additional query-time 
   processing before facet matching? (Which fieldType or query param does the 
   facet small-search box use?)
   2. 
   
   Am I missing a Solr parameter that controls how facet search (the small 
   value search) is executed so I can inject folding/normalization? (e.g. use 
   of facet.contains, facet.prefix, facet.method, facet.contains.ignoreCase or 
   special params?)
   3. 
   
   Has anyone solved diacritics matching in the sidebar facets without 
   losing the displayed original strings? Best-practice patterns: use 
   copyField, multi-field approach, mapping char filter, or client-side 
   solution?
   4. 
   
   If multi-field (search+display) is the recommended approach but DSpace 
   insists on *_filter, is there a recommended DSpace config or XSL/template 
   hook to let facet UI show stored display value while facet search works 
   against a different indexed token?
   
What I can provide if helpful:

   - 
   
   sample schema.xml snippets
   - 
   
   Solr analysis outputs (index vs query) for sample values like kâmil, 
   kâmil\n|||\nKâmil etc.
   - 
   
   steps I used to test in Solr Admin (analysis page) and example queries 
   that fail.
   
Thanks in advance any pointers, config snippets, or DSpace-specific 
guidance would be appreciated.

Muhep

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-tech/a3a225de-984b-4977-8be3-d47eecdef1f6n%40googlegroups.com.

[dspace-tech] Facet filter matching fails for diacritics (â, special chars) in sidebar

Reply via email to