On Thursday 04 December 2008 09:06, j16sdiz at freenetproject.org wrote:
> Author: j16sdiz
> Date: 2008-12-04 09:06:29 +0000 (Thu, 04 Dec 2008)
> New Revision: 24033
> 
> Modified:
>    trunk/plugins/XMLSpider/XMLSpider.java
> Log:
> solve bug 1714, index site with accent character
> 
> 
> Modified: trunk/plugins/XMLSpider/XMLSpider.java
> ===================================================================
> --- trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 08:33:28 UTC (rev 
24032)
> +++ trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 09:06:29 UTC (rev 
24033)
> @@ -803,7 +803,7 @@
>               MessageDigest md;
>               md = MessageDigest.getInstance("MD5");
>               byte[] md5hash = new byte[32];
> -             md.update(text.getBytes("iso-8859-1"), 0, text.length());
> +             md.update(text.getBytes("UTF-8"), 0, text.length());

Good, but you should have incremented the version number of the spider.

>               md5hash = md.digest();
>               return convertToHex(md5hash);
>       }
> @@ -1176,8 +1176,9 @@
>                       else type = null;
>                       /*
>                        * determine the position of the word in the retrieved 
> page
> +                      * FIXME - replace with a real tokenizor
>                        */
> -                     String[] words = s.split("[^A-Za-z0-9]");
> +                     String[] words = s.split("[^\\p{L}\\{N}]");

According to the javadocs, \p{Lower} only works for US-ASCII characters, also 
you're losing the upper case characters. I dunno what \{N} is.

We need to split on the basis of Character.isLetter || isDigit, something 
like:

for(int i=0;i<s.size();i++) {
        int fullChar = codePointAt(s, i);
        if(Character.isSupplementaryCodePoint(fullChar)) i++;
        if(Character.isLetterOrDigit(fullChar)) {
                // Add to current word, or start new word
        } else {
                // Complete last word, or ignore
        }
}
// Finish last word

Of course we can special-case non-surrogate chars as an optimisation...

There doesn't seem to be any easy way to identify that a character does not 
require spaces and therefore should be indexed in its own right. :|
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20081204/abe4ae9c/attachment.pgp>

Reply via email to