Actually, I think a new version of the tokenizer would have to be a distinct tokenizer (ie, "porter" versus "porter1" versus "porter2", whatever). fts4 should not interpret the meaning of an explicit tokenizer differently from fts3, but it could use a different default tokenizer.
[Don't take this as gospel, but that's my understanding of the lay of the land.] -scott On Wed, Feb 24, 2010 at 10:28 AM, James Berry <ja...@jberry.us> wrote: > drh, > > Thanks for the response: it's nice to know that the report was actually seen. > > It would be hubris indeed to claim to fix an implementation bug in Porter's > code. The code in sqlite didn't match any of Porter's code I could find, so I > assumed it came from elsewhere: but maybe I missed something. In any event, > the authorship wasn't clear to me from the sources. The real point that I had > missed was that, as Shane Harrelson points out, step 1c changed between the > original porter stemmer and the porter2 stemmer; the step I quote below, and > which I "fixed", is in the porter2 algorithm, which in this case introduces > an improvement from porter. So in essence I guess my patch moves porter a bit > closer to porter2. > > I understand the complication that changes to the stemmer would cause an > incompatibility. It might be interesting to implement the porter2 algorithm > for fts4; I'm not sure how the two compare in terms of performance. > > Thanks again, > > James > > > On Feb 24, 2010, at 7:05 AM, D. Richard Hipp wrote: > >> We got the Porter stemmer code directly from Martin Porter. >> >> I'm sorry it does not work like you want it to. Unfortunately, we >> cannot change it now without introducing a serious incompatibility >> with the millions and millions of applications already in the field >> that are using the existing implementation. >> >> FTS3 has a pluggable stemmer module. You can write your own stemmer >> that works "correctly" if you like, and link it in for use in your >> applications. We will also investigate making your recommended >> changes for FTS4. However, in order to maintain backwards >> compatibility of FTS3, we cannot change the stemmer algorithm, even to >> fix a "bug". >> >> On Feb 24, 2010, at 9:59 AM, James Berry wrote: >> >>> Can somebody please clarify the bug reporting process for sqlite? My >>> understanding is that it's not possible to file bug reports >>> directly, and that the advise is to write to the user list first. >>> I've done that (below) but have no response so far and am concerned >>> that this means the bug report will just be forgotten others, as >>> well as by me. >>> >>> How does this bug move from a message on a list to a ticket (and >>> ultimately a patch, we hope) in the system? >>> >>> James >>> >>> On Feb 22, 2010, at 2:51 PM, James Berry wrote: >>> >>>> I'm writing to report a bug in the porter-stemmer algorithm >>>> supplied as part of the FTS3 implementation. >>>> >>>> The stemmer has an inverted logic error that prevents it from >>>> properly stemming words of the following form: >>>> >>>> dry -> dri >>>> cry -> cri >>>> >>>> This means, for instance, that the following words don't stem the >>>> same: >>>> >>>> dried -> dri -doesn't match- dry >>>> cried -> cry -doesn't match- cry >>>> >>>> The bug seems to have been introduced as a simple logic error by >>>> whoever wrote the stemmer code. The original description of step 1c >>>> is here: http://snowball.tartarus.org/algorithms/english/stemmer.html >>>> >>>> Step 1c: >>>> replace suffix y or Y by i if preceded by a non-vowel which is >>>> not the first letter of the word (so cry -> cri, by -> by, say -> >>>> say) >>>> >>>> But the code in sqlite reads like this: >>>> >>>> /* Step 1c */ >>>> if( z[0]=='y' && hasVowel(z+1) ){ >>>> z[0] = 'i'; >>>> } >>>> >>>> In other words, sqlite turns the y into an i only if it is preceded >>>> by a vowel (say -> sai), while the algorithm intends this to be >>>> done if it is _not_ preceded by a vowel. >>>> >>>> But there are two other problems in that same line of code: >>>> >>>> (1) hasVowel checks whether a vowel exists anywhere in the string, >>>> not just in the next character, which is incorrect, and goes >>>> against the step 1c directions above. (amplify would not be >>>> properly stemmed to amplifi, for instance) >>>> >>>> (2) The check for the first letter is not performed (for words >>>> like "by", etc) >>>> >>>> I've fixed both of those errors in the patch below: >>>> >>>> /* Step 1c */ >>>> - if( z[0]=='y' && hasVowel(z+1) ){ >>>> + if( z[0]=='y' && isConsonant(z+1) && z[2] ){ >>>> z[0] = 'i'; >>>> } >>>> >>>> _______________________________________________ >>>> sqlite-users mailing list >>>> sqlite-users@sqlite.org >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>> >>> _______________________________________________ >>> sqlite-users mailing list >>> sqlite-users@sqlite.org >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> >> D. Richard Hipp >> d...@hwaci.com >> >> >> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users