XMLSpider

Matthew Toseland Thu, 4 Dec 2008 14:47:04 +0000

On Thursday 04 December 2008 13:46, Daniel Cheng wrote:
> On Thu, Dec 4, 2008 at 9:43 PM, Daniel Cheng <j16sdiz+freenet at gmail.com> 
wrote:
> > On Thu, Dec 4, 2008 at 9:07 PM, Matthew Toseland
> > <toad at amphibian.dyndns.org> wrote:
> >> On Thursday 04 December 2008 09:06, j16sdiz at freenetproject.org wrote:
> >>> Author: j16sdiz
> >>> Date: 2008-12-04 09:06:29 +0000 (Thu, 04 Dec 2008)
> >>> New Revision: 24033
> >>>
> >>> Modified:
> >>>    trunk/plugins/XMLSpider/XMLSpider.java
> >>> Log:
> >>> solve bug 1714, index site with accent character
> >>>
> >>>
> >>> Modified: trunk/plugins/XMLSpider/XMLSpider.java
> >>> ===================================================================
> >>> --- trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 08:33:28 UTC 
(rev
> >> 24032)
> >>> +++ trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 09:06:29 UTC 
(rev
> >> 24033)
> >>> @@ -803,7 +803,7 @@
> >>>               MessageDigest md;
> >>>               md = MessageDigest.getInstance("MD5");
> >>>               byte[] md5hash = new byte[32];
> >>> -             md.update(text.getBytes("iso-8859-1"), 0, text.length());
> >>> +             md.update(text.getBytes("UTF-8"), 0, text.length());
> >>
> >> Good, but you should have incremented the version number of the spider.
> >
> > how?
> >
> >>>               md5hash = md.digest();
> >>>               return convertToHex(md5hash);
> >>>       }
> >>> @@ -1176,8 +1176,9 @@
> >>>                       else type = null;
> >>>                       /*
> >>>                        * determine the position of the word in the 
retrieved page
> >>> +                      * FIXME - replace with a real tokenizor
> >>>                        */
> >>> -                     String[] words = s.split("[^A-Za-z0-9]");
> >>> +                     String[] words = s.split("[^\\p{L}\\{N}]");
> >>
> >> According to the javadocs, \p{Lower} only works for US-ASCII characters, 
also
> >> you're losing the upper case characters. I dunno what \{N} is.
> >
> 
> and \p{L} is \p{Letter}, not \p{Lower}
> see http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt


Does Java actually support either of these? They don't seem to be in the regex 
docs for 1.5.
> 
> > \{N} is all number, include strange one like this:
> >  http://www.fileformat.info/info/unicode/char/3251/index.htm
> >
> >  see http://www.regular-expressions.info/unicode.html#prop and
> > http://unicode.org/reports/tr18/#Categories for details
> >
> >> We need to split on the basis of Character.isLetter || isDigit, something
> >> like:
> >>
> >> for(int i=0;i<s.size();i++) {
> >>        int fullChar = codePointAt(s, i);
> >>        if(Character.isSupplementaryCodePoint(fullChar)) i++;
> >>        if(Character.isLetterOrDigit(fullChar)) {
> >>                // Add to current word, or start new word
> >>        } else {
> >>                // Complete last word, or ignore
> >>        }
> >> }
> >> // Finish last word
> >
> > This is just a quick hack to make french / Russian works.
> > (I have just tested Russian ..... I guess Hebrew should work too)

You're sure it works with both upper and lower case characters? And that the 
regex is valid despite using undocumented features?
> >
> >> Of course we can special-case non-surrogate chars as an optimisation...
> >
> > Surrogate chars are not the same as SupplementaryCodePoint ...
> >
> > Supplementary code points are non-BMP (such as CJK Unified Ideographs
> > Extension B)
> > The language in Extension B (Chinese, Koren, Ancient Vienem, etc...)
> > are not separated by space -- they need special tokenizors.
> >
> > Surrogate chars need normalization -- those methods are only
> > available in Java 6, and I don't think we should ship our own
> > normalization table.

I don't follow. Character in java 5 provides methods to construct a code point 
from either one or two (16-bit) char's, thus dealing adequately with 
supplementary (32-bit) characters. We can therefore deal with chinese, and 
anything up to 0x10FFFF, right?

However, we do need to identify that a character is a 
pictogram/ideogram/word-character/whatever you call them and therefore needs 
to be indexed in its own right.
> >
> >> There doesn't seem to be any easy way to identify that a character does 
not
> >> require spaces and therefore should be indexed in its own right. :|
> >
> > Ya... that's why I just extended the original method to take more letter..
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20081204/8a64e75e/attachment.pgp>

[freenet-dev] [freenet-cvs] r24033 - trunk/plugins/XMLSpider

Reply via email to