Hi all,
I'm running into a problem with sidebar facet filtering (browse-by-value /
filter list) in a DSpace 8.x (Solr-backed) installation. The UI displays
facet values correctly (with diacritics) but searching in the facet input
(the small search box in the sidebar) behaves like a strict exact-match:
when I type an ASCII version or remove diacritics, items that only differ
by diacritics do not match. Example:
-
Stored/display value: Hamzazâde Esad
-
If the user types hamzazade in the facet search box, it does not return
the expected facet value or matching results.
What I found and tried
-
The dynamic field for sidebar facets is *_filter:
<dynamicField name="*_filter" type="keywordFilter" indexed="true"
stored="true" multiValued="true" omitNorms="true" />
-
Current keywordFilter fieldType (originally):
<fieldType name="keywordFilter" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
This keeps the stored/display value exact (good), but facet search behaves
like exact-match because the index token is the full original string.
-
I attempted to add Turkish/ICU folding to the analyzer. When I add
folding to index analyzer, displayed facet strings started appearing
lowercased and diacritics lost (bad for presentation). So I tried to split
behaviors with index vs query analyzer:
<fieldType name="keywordFilter" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
maxGramSize="15"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
This left displayed values untouched and query normalization runs but still
the problematic diacritic letter â (and some other special characters) does
*not* reliably match when the user types ASCII a. I tested analyzer outputs
in Solr Admin Analysis; query tokens become kamil from kâmil, but facets
still don't return as expected.
Important constraints:
-
DSpace code (the UI / server) expects author_filter style fields and
currently both search and display use the same *_filter field for facets —
I cannot change DSpace to use a separate *_search field easily.
-
I cannot use WhitespaceTokenizer on query time because index uses
KeywordTokenizer (index tokens are whole values) and mismatch causes no
hits.
Questions / requests for help
1.
Does the DSpace sidebar facet search use the same keywordFilter
fieldType defined above, or does DSpace apply additional query-time
processing before facet matching? (Which fieldType or query param does the
facet small-search box use?)
2.
Am I missing a Solr parameter that controls how facet search (the small
value search) is executed so I can inject folding/normalization? (e.g. use
of facet.contains, facet.prefix, facet.method, facet.contains.ignoreCase or
special params?)
3.
Has anyone solved diacritics matching in the sidebar facets without
losing the displayed original strings? Best-practice patterns: use
copyField, multi-field approach, mapping char filter, or client-side
solution?
4.
If multi-field (search+display) is the recommended approach but DSpace
insists on *_filter, is there a recommended DSpace config or XSL/template
hook to let facet UI show stored display value while facet search works
against a different indexed token?
What I can provide if helpful:
-
sample schema.xml snippets
-
Solr analysis outputs (index vs query) for sample values like kâmil,
kâmil\n|||\nKâmil etc.
-
steps I used to test in Solr Admin (analysis page) and example queries
that fail.
Thanks in advance any pointers, config snippets, or DSpace-specific
guidance would be appreciated.
Muhep
--
All messages to this mailing list should adhere to the Code of Conduct:
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/dspace-tech/a3a225de-984b-4977-8be3-d47eecdef1f6n%40googlegroups.com.