Re: [sqlite] Bug in porter stemmer

Scott Hess Wed, 24 Feb 2010 11:09:16 -0800

Actually, I think a new version of the tokenizer would have to be a
distinct tokenizer (ie, "porter" versus "porter1" versus "porter2",
whatever).  fts4 should not interpret the meaning of an explicit
tokenizer differently from fts3, but it could use a different default
tokenizer.


[Don't take this as gospel, but that's my understanding of the lay of the land.]

-scott


On Wed, Feb 24, 2010 at 10:28 AM, James Berry <ja...@jberry.us> wrote:
> drh,
>
> Thanks for the response: it's nice to know that the report was actually seen.
>
> It would be hubris indeed to claim to fix an implementation bug in Porter's 
> code. The code in sqlite didn't match any of Porter's code I could find, so I 
> assumed it came from elsewhere: but maybe I missed something. In any event, 
> the authorship wasn't clear to me from the sources. The real point that I had 
> missed was that, as Shane Harrelson points out, step 1c changed between the 
> original porter stemmer and the porter2 stemmer; the step I quote below, and 
> which I "fixed", is in the porter2 algorithm, which in this case introduces 
> an improvement from porter. So in essence I guess my patch moves porter a bit 
> closer to porter2.
>
> I understand the complication that changes to the stemmer would cause an 
> incompatibility. It might be interesting to implement the porter2 algorithm 
> for fts4; I'm not sure how the two compare in terms of performance.
>
> Thanks again,
>
> James
>
>
> On Feb 24, 2010, at 7:05 AM, D. Richard Hipp wrote:
>
>> We got the Porter stemmer code directly from Martin Porter.
>>
>> I'm sorry it does not work like you want it to.  Unfortunately, we
>> cannot change it now without introducing a serious incompatibility
>> with the millions and millions of applications already in the field
>> that are using the existing implementation.
>>
>> FTS3 has a pluggable stemmer module.  You can write your own stemmer
>> that works "correctly" if you like, and link it in for use in your
>> applications.  We will also investigate making your recommended
>> changes for FTS4.  However, in order to maintain backwards
>> compatibility of FTS3, we cannot change the stemmer algorithm, even to
>> fix a "bug".
>>
>> On Feb 24, 2010, at 9:59 AM, James Berry wrote:
>>
>>> Can somebody please clarify the bug reporting process for sqlite? My
>>> understanding is that it's not possible to file bug reports
>>> directly, and that the advise is to write to the user list first.
>>> I've done that (below) but have no response so far and am concerned
>>> that this means the bug report will just be forgotten others, as
>>> well as by me.
>>>
>>> How does this bug move from a message on a list to a ticket (and
>>> ultimately a patch, we hope) in the system?
>>>
>>> James
>>>
>>> On Feb 22, 2010, at 2:51 PM, James Berry wrote:
>>>
>>>> I'm writing to report a bug in the porter-stemmer algorithm
>>>> supplied as part of the FTS3 implementation.
>>>>
>>>> The stemmer has an inverted logic error that prevents it from
>>>> properly stemming words of the following form:
>>>>
>>>>     dry -> dri
>>>>     cry -> cri
>>>>
>>>> This means, for instance, that the following words don't stem the
>>>> same:
>>>>
>>>>     dried -> dri   -doesn't match-   dry
>>>>     cried -> cry   -doesn't match-   cry
>>>>
>>>> The bug seems to have been introduced as a simple logic error by
>>>> whoever wrote the stemmer code. The original description of step 1c
>>>> is here: http://snowball.tartarus.org/algorithms/english/stemmer.html
>>>>
>>>>     Step 1c:
>>>>             replace suffix y or Y by i if preceded by a non-vowel which is
>>>> not the first letter of the word (so cry -> cri, by -> by, say ->
>>>> say)
>>>>
>>>> But the code in sqlite reads like this:
>>>>
>>>> /* Step 1c */
>>>> if( z[0]=='y' && hasVowel(z+1) ){
>>>>  z[0] = 'i';
>>>> }
>>>>
>>>> In other words, sqlite turns the y into an i only if it is preceded
>>>> by a vowel (say -> sai), while the algorithm intends this to be
>>>> done if it is _not_ preceded by a vowel.
>>>>
>>>> But there are two other problems in that same line of code:
>>>>
>>>>     (1) hasVowel checks whether a vowel exists anywhere in the string,
>>>> not just in the next character, which is incorrect, and goes
>>>> against the step 1c directions above. (amplify would not be
>>>> properly stemmed to amplifi, for instance)
>>>>
>>>>     (2) The check for the first letter is not performed (for words
>>>> like "by", etc)
>>>>
>>>> I've fixed both of those errors in the patch below:
>>>>
>>>> /* Step 1c */
>>>> -  if( z[0]=='y' && hasVowel(z+1) ){
>>>> + if( z[0]=='y' && isConsonant(z+1) && z[2] ){
>>>>   z[0] = 'i';
>>>> }
>>>>
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> sqlite-users@sqlite.org
>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users@sqlite.org
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>> D. Richard Hipp
>> d...@hwaci.com
>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Bug in porter stemmer

Reply via email to