On Thu, Dec 4, 2008 at 9:07 PM, Matthew Toseland
<toad at amphibian.dyndns.org> wrote:
> On Thursday 04 December 2008 09:06, j16sdiz at freenetproject.org wrote:
>> Author: j16sdiz
>> Date: 2008-12-04 09:06:29 +0000 (Thu, 04 Dec 2008)
>> New Revision: 24033
>>
>> Modified:
>>    trunk/plugins/XMLSpider/XMLSpider.java
>> Log:
>> solve bug 1714, index site with accent character
>>
>>
>> Modified: trunk/plugins/XMLSpider/XMLSpider.java
>> ===================================================================
>> --- trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 08:33:28 UTC (rev
> 24032)
>> +++ trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 09:06:29 UTC (rev
> 24033)
>> @@ -803,7 +803,7 @@
>>               MessageDigest md;
>>               md = MessageDigest.getInstance("MD5");
>>               byte[] md5hash = new byte[32];
>> -             md.update(text.getBytes("iso-8859-1"), 0, text.length());
>> +             md.update(text.getBytes("UTF-8"), 0, text.length());
>
> Good, but you should have incremented the version number of the spider.

how?

>>               md5hash = md.digest();
>>               return convertToHex(md5hash);
>>       }
>> @@ -1176,8 +1176,9 @@
>>                       else type = null;
>>                       /*
>>                        * determine the position of the word in the retrieved 
>> page
>> +                      * FIXME - replace with a real tokenizor
>>                        */
>> -                     String[] words = s.split("[^A-Za-z0-9]");
>> +                     String[] words = s.split("[^\\p{L}\\{N}]");
>
> According to the javadocs, \p{Lower} only works for US-ASCII characters, also
> you're losing the upper case characters. I dunno what \{N} is.

\{N} is all number, include strange one like this:
  http://www.fileformat.info/info/unicode/char/3251/index.htm

  see http://www.regular-expressions.info/unicode.html#prop and
http://unicode.org/reports/tr18/#Categories for details

> We need to split on the basis of Character.isLetter || isDigit, something
> like:
>
> for(int i=0;i<s.size();i++) {
>        int fullChar = codePointAt(s, i);
>        if(Character.isSupplementaryCodePoint(fullChar)) i++;
>        if(Character.isLetterOrDigit(fullChar)) {
>                // Add to current word, or start new word
>        } else {
>                // Complete last word, or ignore
>        }
> }
> // Finish last word

This is just a quick hack to make french / Russian works.
(I have just tested Russian ..... I guess Hebrew should work too)

> Of course we can special-case non-surrogate chars as an optimisation...

Surrogate chars are not the same as SupplementaryCodePoint ...

Supplementary code points are non-BMP (such as CJK Unified Ideographs
Extension B)
The language in Extension B (Chinese, Koren, Ancient Vienem, etc...)
are not separated by space -- they need special tokenizors.

Surrogate chars need normalization -- those methods are only
available in Java 6, and I don't think we should ship our own
normalization table.

> There doesn't seem to be any easy way to identify that a character does not
> require spaces and therefore should be indexed in its own right. :|

Ya... that's why I just extended the original method to take more letter..


--

Reply via email to