Thanks, Christian. What is the effective character set used when diacritics are removed? Latin-1?
Tim -- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson [email protected] On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <[email protected]> wrote: > Hi Tim, > > > I have a question about the BaseX ft:normalize function. What kind of > Unicode normalization is performed by this function, and how might it be > implemented using standard XPath functions? > > The function is based on a custom BaseX tokenization, which includes > normalization of case, removal of diacritics and (if enabled) > language-based stemming. It would be rather challenging to implement > the behavior with standard XPath (that’s mostly why we introduced > ft:tokenize and ft:normalize). If you are looking for a starting > point, you could begin with the FtTokenize Java class [1]. > > Hope this helps, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/ft/FtTokenize.java#L31-L51 >

