On Thu, Dec 4, 2008 at 9:43 PM, Daniel Cheng <j16sdiz+freenet at gmail.com> 
wrote:
> On Thu, Dec 4, 2008 at 9:07 PM, Matthew Toseland
> <toad at amphibian.dyndns.org> wrote:
>> On Thursday 04 December 2008 09:06, j16sdiz at freenetproject.org wrote:
>>> Author: j16sdiz
>>> Date: 2008-12-04 09:06:29 +0000 (Thu, 04 Dec 2008)
>>> New Revision: 24033
>>>
>>> Modified:
>>>    trunk/plugins/XMLSpider/XMLSpider.java
>>> Log:
>>> solve bug 1714, index site with accent character
>>>
>>>
>>> Modified: trunk/plugins/XMLSpider/XMLSpider.java
>>> ===================================================================
>>> --- trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 08:33:28 UTC (rev
>> 24032)
>>> +++ trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 09:06:29 UTC (rev
>> 24033)
>>> @@ -803,7 +803,7 @@
>>>               MessageDigest md;
>>>               md = MessageDigest.getInstance("MD5");
>>>               byte[] md5hash = new byte[32];
>>> -             md.update(text.getBytes("iso-8859-1"), 0, text.length());
>>> +             md.update(text.getBytes("UTF-8"), 0, text.length());
>>
>> Good, but you should have incremented the version number of the spider.
>
> how?
>
>>>               md5hash = md.digest();
>>>               return convertToHex(md5hash);
>>>       }
>>> @@ -1176,8 +1176,9 @@
>>>                       else type = null;
>>>                       /*
>>>                        * determine the position of the word in the 
>>> retrieved page
>>> +                      * FIXME - replace with a real tokenizor
>>>                        */
>>> -                     String[] words = s.split("[^A-Za-z0-9]");
>>> +                     String[] words = s.split("[^\\p{L}\\{N}]");
>>
>> According to the javadocs, \p{Lower} only works for US-ASCII characters, also
>> you're losing the upper case characters. I dunno what \{N} is.
>

and \p{L} is \p{Letter}, not \p{Lower}
see http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt

> \{N} is all number, include strange one like this:
>  http://www.fileformat.info/info/unicode/char/3251/index.htm
>
>  see http://www.regular-expressions.info/unicode.html#prop and
> http://unicode.org/reports/tr18/#Categories for details
>
>> We need to split on the basis of Character.isLetter || isDigit, something
>> like:
>>
>> for(int i=0;i<s.size();i++) {
>>        int fullChar = codePointAt(s, i);
>>        if(Character.isSupplementaryCodePoint(fullChar)) i++;
>>        if(Character.isLetterOrDigit(fullChar)) {
>>                // Add to current word, or start new word
>>        } else {
>>                // Complete last word, or ignore
>>        }
>> }
>> // Finish last word
>
> This is just a quick hack to make french / Russian works.
> (I have just tested Russian ..... I guess Hebrew should work too)
>
>> Of course we can special-case non-surrogate chars as an optimisation...
>
> Surrogate chars are not the same as SupplementaryCodePoint ...
>
> Supplementary code points are non-BMP (such as CJK Unified Ideographs
> Extension B)
> The language in Extension B (Chinese, Koren, Ancient Vienem, etc...)
> are not separated by space -- they need special tokenizors.
>
> Surrogate chars need normalization -- those methods are only
> available in Java 6, and I don't think we should ship our own
> normalization table.
>
>> There doesn't seem to be any easy way to identify that a character does not
>> require spaces and therefore should be indexed in its own right. :|
>
> Ya... that's why I just extended the original method to take more letter..
>
>
> --
>

Reply via email to