On Wed, Nov 16, 2011 at 5:24 PM, Nick Wellnhofer <[email protected]> wrote: > On 16/11/11 04:49, Marvin Humphrey wrote: >> >> It would be great to support accent stripping in Lucy -- that's something >> a >> lot of people need. Normalization would also be a nice feature to offer >> (Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's >> replacement?). > > Thinking about the implications of Unicode in the analyzer chain, I've come > to the conclusion that the first step should always be tokenization. In the > current implementation the CaseFolder comes first in the chain by default. > But case folding (or lowercasing) can add or remove Unicode codepoints and > mess with the character offsets for the highlighter. See the attached script > for a demonstration. > >> It would also be great to migrate Lucy::Analysis::CaseFolder code away >> from >> its dependency on the Perl C API. > > Yes, we could even do proper Unicode case folding, normalization and accent > stripping in one pass with utf8proc. This should be the next step after > tokenization. The stopalizer and stemmers should be safe when using NFC or > NFKC. I think we can leave the choice between these normalization forms to > the user. > > If we go with utf8proc, I would propose a new analyzer > Lucy::Analysis::Normalizer with the following interface: > > my $normalizer = Lucy::Analysis::Normalizer->new( > normalization_form => $string, > case_fold => $bool, > strip_accents => $bool, > ); > > normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The > decomposed forms won't play well with other analyzers but could be easily > added for completeness. I'm not sure whether we should default to NFC or > NFKC. > > case_fold and strip_accents are simple on/off switches. By default case_fold > is enabled and strip_accents disabled. >
Does your unicode library also support "NFKC_CaseFold" ? It might be a nice default: # Derived Property: NFKC_Casefold (NFKC_CF) # This property removes certain variations from characters: case, compatibility, and default-ignorables. # It is used for loose matching and certain types of identifiers. # It is constructed by applying NFKC, CaseFolding, and removal of Default_Ignorable_Code_Points. # The process of applying these transformations is repeated until a stable result is produced. -- lucidimagination.com
