On 11/06/2014 12:56, Thomas Neidhart wrote: > Hi, > > as already commented on https://issues.apache.org/jira/browse/CODEC-187 the > problem is related to some wrongly ported rule files from the original > source. > > This otoh, creates a serious problem for us, as it looks like that the > Beider-Morse phonetic matching encoder in commons-codec is derived work > from a php codebase released under the GPLv3 licence. > The original codebase is available at http://stevemorse.org/phoneticinfo.htm. > While investigating the bug and comparing our rule file with the ones from > the origina codebase it is quite clear that at least these are identical. > > The author of the patch (see https://issues.apache.org/jira/browse/CODEC-125) > ported the code and applied the Apache license, but the license of the > original codebase was never considered or discussed. > > This is quite serious I guess, as we have already released the code. We can > ask the original authors to re-license their code to the Apache Software > Foundation under a compatible license, but I wonder if they are willing to > do so. > This encoder is also used a lot in lucene/solr so it might have even larger > implications. > > Any ideas how to proceed or if a re-licensing would be sufficient in this > case?
Re-licensing or permission from the original authors would be sufficient. If that is not forthcoming then there is no option but to delete the code. Replacing any removed code with a 'clean-room' implementation would be acceptable but in that case the removal of the current code must not wait for any replacement. Mark > > Thomas > > > On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mich...@tobias.org.uk> > wrote: > >> Does anybody have a working knowledge of the coding of the Beider Morse >> Phonetic Matching in the Apache Commons Codec? >> >> >> >> My recent tests using Solr suggest there is a discrepancy between Steve >> Morse and Alexander Beider's algorithm and the algorithm currently live in >> Solr (and hence the Commons Codec). >> >> >> >> I know that the source code for BMPM issued by Steve has changed several >> times over the years, and I thought at first it might be that the version >> used in the Commons Codec is an old version that has subsequently been >> overtaken. Should the version of the BMPM algorithm not be listed in the >> Commons Codec documentation? How should version changes to the algorithm be >> implemented? The algorithm is quite static now so this is probably not so >> important now but surely it should be DOCUMENTED??? >> >> >> >> My tests now indicate that the discrepancies are NOT a version problem as >> testing against a very old version 2.00 of the BMPM source code issued on >> 18 >> June 2009 still exhibits the same problem. >> >> >> >> Using just a single test term the results are not good. The only saving >> grace is that the most widely used version is >> >> >> >> nameType="GENERIC" ruleType="APPROX" >> >> >> >> and that is a close (but not perfect) match at least for this ONE test >> word. >> >> >> >> For the name Abram, all with languageSet="auto" >> >> >> >> GENERIC APPROX - fails - misses a few tokens >> >> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran >> abron obran obron Ybram Ybrom >> >> Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron >> obran obron >> >> >> >> GENERIC EXACT - good! >> >> Should create tokens: abram abran >> >> Solr creates: abram abran >> >> >> >> ASHKENAZI APPROX: - fails dreadfully! >> >> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram >> Ybrom ombram ombrom imbram imbrom >> >> Solr creates: abrAm AvrAm BbrAm >> >> >> >> ASHKENAZI EXACT: - good! >> >> Should create tokens: abram >> >> Solr creates: abram >> >> >> >> SEPHARDIC APPROX: - good! >> >> Should create tokens: abram bram abran bran avram vram >> >> Solr creates: abram bram abran bran avram vram >> >> >> >> SEPHARDIC EXACT: - good! >> >> Should create tokens: abram abran avram >> >> Solr creates: abram abran avram >> >> >> >> I would appreciate it if somebody with knowledge of the programming of this >> functionality could investigate. >> >> >> >> For the worst case I attach here a debug trace of the calculation of the >> Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks >> like some of the final rules are not being implemented properly, or at all. >> The language codes in parenthesis vary from BMPM version to version but the >> resulting tokens have not changed from version 2.00 up to the current 3.02 >> >> >> >> Thanks >> >> >> >> Michael >> >> >> >> >> >> >> >> applying language rules from (rulesany) to abram using languages 2012 >> >> char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m >> >> applying rule #225 >> pattern=a >> lcontext= >> rcontext=[bcdgkpstwzż] >> subst=(A|B[128]) >> result=(A[2012]|B[128]) >> >> applying rule #229 >> pattern=b >> lcontext= >> rcontext= >> subst=b >> result=(Ab[2012]|Bb[128]) >> >> applying rule #245 >> pattern=r >> lcontext= >> rcontext= >> subst=r >> result=(Abr[2012]|Bbr[128]) >> >> applying rule #228 >> pattern=a >> lcontext= >> rcontext= >> subst=A >> result=(AbrA[2012]|BbrA[128]) >> >> applying rule #240 >> pattern=m >> lcontext= >> rcontext= >> subst=m >> result=(AbrAm[2012]|BbrAm[128]) >> >> after language rules: (AbrAm[2012]|BbrAm[128]) >> >> >> applying final rules from (exactapproxcommon plus approxcommon) to >> AbrAm[2012] >> no rules match for phonetic item 0 at position 0: A >> no rules match for phonetic item 0 at position 1: Ab >> no rules match for phonetic item 0 at position 2: Abr >> no rules match for phonetic item 0 at position 3: AbrA >> no rules match for phonetic item 0 at position 4: AbrAm >> >> applying final rules from (exactapproxcommon plus approxcommon) to >> BbrAm[128] >> no rules match for phonetic item 1 at position 0: B >> no rules match for phonetic item 1 at position 1: Bb >> no rules match for phonetic item 1 at position 2: Bbr >> no rules match for phonetic item 1 at position 3: BbrA >> no rules match for phonetic item 1 at position 4: BbrAm >> >> applying final rules from (approxany) to AbrAm[2012] >> after applying final rule #97 to phonetic item #0 at position 0: >> (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16]) >> after applying final rule #0 to phonetic item #0 at position 1: >> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext= >> subst=(b|v[1024]) >> no rules match for phonetic item 0 at position 2: >> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r >> after applying final rule #93 to phonetic item #0 at position 3: >> >> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024 >> ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$ >> subst=(a|o) >> no rules match for phonetic item 0 at position 4: >> >> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024 >> ]|ovro[1024]|Ybra[16]|Ybro[16])m >> >> applying final rules from (approxany) to BbrAm[128] >> after applying final rule #22 to phonetic item #1 at position 0: >> (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp] >> subst=(o|om[128]|im[128]) >> after applying final rule #0 to phonetic item #1 at position 1: >> (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext= >> subst=(b|v[1024]) >> no rules match for phonetic item 1 at position 2: >> (ob[2012]|ov[1024]|omb[128]|imb[128])r >> after applying final rule #93 to phonetic item #1 at position 3: >> >> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128 >> ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o) >> no rules match for phonetic item 1 at position 4: >> >> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128 >> ]|imbro[128])m >> >> >> >> >> >> >> >> resulting tokens: >> >> >> >> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i >> mbram|imbrom) >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org