On Wed, Feb 08, 2012 at 05:04:56PM +0100, Nick Wellnhofer wrote:
> On 23/12/2011 04:18, Marvin Humphrey wrote:
>> Now that EasyAnalyzer is in, I think we should promote the use of all the
>> improvements Nick has made to the analysis chain.
>>
>> * Swap in EasyAnalyzer for PolyAnalyzer, Normalizer for CaseFolder, and
>> StandardTokenizer for RegexTokenizer everywhere we can.
>
> Done.
Excellent! Lots of great-looking commits coming through. The revisions to
the tutorial looked sane; I figured that would be the trickiest part.
>> * Deprecate the "language" parameter to PolyAnalyzer#new.
>>
>> By "deprecate", I mean:
>>
>> * Open a JIRA issue so that a suitably titled entry ends up in the CHANGES
>> file.
>> * Mark the "language" param as "deprecated" in the PolyAnalyzer docs.
>>
>> We don't have a strong deprecation mechanism available to us right now, so I
>> think that's the best we can do.
>
> I just noticed that I removed the "language" parameter from the
> PolyAnalyzer docs, but I can revert that part of my commit and mark it
> as deprecated.
>
> Regarding the JIRA issue: I couldn't find a good issue type for
> deprecations. "Task" seems the most appropriate to me.
I agree, there's no good answer, so +1 for "Task".
>> It's not important that any of these changes happen before 0.3.0. The docs
>> changes can happen at any time, and the parameter deprecation only allows the
>> simplification of a single class (PolyAnalyzer itself). It would also be
>> nice
>> to switch most test cases to use the new Analyzers, but that can also happen
>> at any time.
>
> The tests have been converted, too.
Lookin' good!
>> In contrast, here are a couple changes we should *not* make prior to 0.3.0,
>> because they have index compatibility implications:
>>
>> * Change Lucy::Simple to use EasyAnalyzer instead of PolyAnalyzer.
>
> I've done that now.
After reviewing the Lucy::Simple code, I realized that we can avoid breaking
compat with only a few extra lines.
* If the index exists during new(), extract the schema and type from what's
on disk.
* Otherwise, create a new EasyAnalyzer for the type.
That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
prior to 0.4.0 are read by 0.4.0 or above.
>> * Implement CaseFolder as a subclass of Normalizer.
>
> This has yet to be done. We could also mark the CaseFolder as deprecated
> and remove it completely later.
The cost for keeping CaseFolder around in its current form is high, because it
is tied into a perlapi function and thus needs a per-host implementation. (The
perlapi function's name broke in late Perl 5.15 releases, which was a PITA to
troubleshoot). In contrast, the cost for keeping CaseFolder around is small
if it becomes a subclass of Normalizer.
However, CaseFolder and Normalizer presumably have slightly different case
mappings, thus the subclassing change is a back compat break. It shouldn't be
a horrible break (depending on how close the mappings are) because it will
only affect search-time, screwing up the results only for terms which contain
code points whose mapping has changed.
I don't think we should outright remove CaseFolder without a really good
reason, because that will force almost all of our users to change their code
and then reindex from scratch. But a subtle compat break might be OK,
especially since you can update all the docs in place after upgrading and only
suffer during a window of time from slightly degraded search results.
Marvin Humphrey