On 14/11/11 22:22, Marvin Humphrey wrote:
On Mon, Nov 14, 2011 at 07:45:36PM +0100, Nick Wellnhofer wrote:
I'm trying to write my own analyzer class that strips accents and does
some other transformations. I had a look at Father Chrysostomos'
KSx::Analysis::StripAccents and tried to get something similar to run
with Lucy 0.2.2. With the following two changes I could make it work:
- The 'transform' method can't reuse the inversion argument but must
return a new inversion.
Lucy::Analysis::SnowballStemmer#transform reuses its Inversion; it should work
for you as well. Perhaps you need to invoke Inversion#reset to reset the
iterator?
That's probably the reason. I'll give it a try.
Are there any other caveats? Is there any documentation on how to write
your own analyzer classes?
The subclassing API for Analyzer was redacted prior to Lucy 0.1 in
anticipation of refactoring; Lucy::Analysis::Inversion and
Lucy::Analysis::Token are not public classes. So what you are trying to do is
not officially supported.
That said, we know that we need to restore this capability. The more people
who are hacking on the Lucy core analysis code, the sooner we will be able to
do so.
Are there any additional pointers for people who'd like to hack on this?
If anyone is interested in a LucyX::Analysis::StripAccents module, I
could put something up on CPAN.
If we were to handle this as a contribution to Lucy itself, so that
LucyX::Analysis::StripAccents would be distributed alongside other LucyX
modules such as the LucyX::Remote classes, that would allow us change the
internal implementation for analysis without causing downstream disruption of
an independent CPAN distro for LucyX::Analysis::StripAccents.
If we go down that path, there are some licensing issues that would need to
be resolved. We'd need Father Chrysostomos on board (which I hope would be
doable), but then there's also the issue of the Text::Unaccent dependency.
Let us know if you'd like to explore that option further.
Text::Unaccent is based on libunac which AFAICS is only available under
GPL. You could also build your own unaccenting tables with
Unicode::Normalize or any other library that can decompose Unicode
strings. Thinking more about it, Unicode normalization would also be a
nice feature for the Lucy analyzer.
Would it make sense to have all the Unicode functionality in the Lucy
core using a third party Unicode library? Or should we rely on the
Unicode support of the host language like we do for case folding?
Nick