On Thu, Dec 4, 2008 at 10:47 PM, Matthew Toseland
<toad at amphibian.dyndns.org> wrote:
> On Thursday 04 December 2008 13:46, Daniel Cheng wrote:
>> On Thu, Dec 4, 2008 at 9:43 PM, Daniel Cheng <j16sdiz+freenet at gmail.com>
> wrote:
>> > On Thu, Dec 4, 2008 at 9:07 PM, Matthew Toseland
>> > <toad at amphibian.dyndns.org> wrote:
>> >> On Thursday 04 December 2008 09:06, j16sdiz at freenetproject.org wrote:
>> >>> Author: j16sdiz
>> >>> Date: 2008-12-04 09:06:29 +0000 (Thu, 04 Dec 2008)
>> >>> New Revision: 24033
>> >>>
>> >>> Modified:
>> >>>    trunk/plugins/XMLSpider/XMLSpider.java
>> >>> Log:
>> >>> solve bug 1714, index site with accent character
>> >>>
>> >>>
>> >>> Modified: trunk/plugins/XMLSpider/XMLSpider.java
>> >>> ===================================================================
>> >>> --- trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 08:33:28 UTC
> (rev
>> >> 24032)
>> >>> +++ trunk/plugins/XMLSpider/XMLSpider.java    2008-12-04 09:06:29 UTC
> (rev
>> >> 24033)
>> >>> @@ -803,7 +803,7 @@
>> >>>               MessageDigest md;
>> >>>               md = MessageDigest.getInstance("MD5");
>> >>>               byte[] md5hash = new byte[32];
>> >>> -             md.update(text.getBytes("iso-8859-1"), 0, text.length());
>> >>> +             md.update(text.getBytes("UTF-8"), 0, text.length());
>> >>
>> >> Good, but you should have incremented the version number of the spider.
>> >
>> > how?
>> >
>> >>>               md5hash = md.digest();
>> >>>               return convertToHex(md5hash);
>> >>>       }
>> >>> @@ -1176,8 +1176,9 @@
>> >>>                       else type = null;
>> >>>                       /*
>> >>>                        * determine the position of the word in the
> retrieved page
>> >>> +                      * FIXME - replace with a real tokenizor
>> >>>                        */
>> >>> -                     String[] words = s.split("[^A-Za-z0-9]");
>> >>> +                     String[] words = s.split("[^\\p{L}\\{N}]");
>> >>
>> >> According to the javadocs, \p{Lower} only works for US-ASCII characters,
> also
>> >> you're losing the upper case characters. I dunno what \{N} is.
>> >
>>
>> and \p{L} is \p{Letter}, not \p{Lower}
>> see http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
>
> Does Java actually support either of these? They don't seem to be in the regex
> docs for 1.5.

javadoc refer to Unicode TR18 Level 1 Support and
  "The category names are those defined in table 4-5 of the Standard
(p. 88), both
   normative and informative. " (Java 1.4)
  "The supported categories are those of  The Unicode Standard in the
version specified
   by the Character class. " (Java 5)

>> > \{N} is all number, include strange one like this:
>> >  http://www.fileformat.info/info/unicode/char/3251/index.htm
>> >
>> >  see http://www.regular-expressions.info/unicode.html#prop and
>> > http://unicode.org/reports/tr18/#Categories for details
>> >
>> >> We need to split on the basis of Character.isLetter || isDigit, something
>> >> like:
>> >>
>> >> for(int i=0;i<s.size();i++) {
>> >>        int fullChar = codePointAt(s, i);
>> >>        if(Character.isSupplementaryCodePoint(fullChar)) i++;
>> >>        if(Character.isLetterOrDigit(fullChar)) {
>> >>                // Add to current word, or start new word
>> >>        } else {
>> >>                // Complete last word, or ignore
>> >>        }
>> >> }
>> >> // Finish last word
>> >
>> > This is just a quick hack to make french / Russian works.
>> > (I have just tested Russian ..... I guess Hebrew should work too)
>
> You're sure it works with both upper and lower case characters? And that the

Both upper and lower character works in most language
(not Turkish and (maybe) some other language -- they need special handling)

> regex is valid despite using undocumented features?

regex is documented -- see above

>> >
>> >> Of course we can special-case non-surrogate chars as an optimisation...
>> >
>> > Surrogate chars are not the same as SupplementaryCodePoint ...
>> >
>> > Supplementary code points are non-BMP (such as CJK Unified Ideographs
>> > Extension B)
>> > The language in Extension B (Chinese, Koren, Ancient Vienem, etc...)
>> > are not separated by space -- they need special tokenizors.
>> >
>> > Surrogate chars need normalization -- those methods are only
>> > available in Java 6, and I don't think we should ship our own
>> > normalization table.
>
> I don't follow. Character in java 5 provides methods to construct a code point
> from either one or two (16-bit) char's, thus dealing adequately with
> supplementary (32-bit) characters. We can therefore deal with chinese, and
> anything up to 0x10FFFF, right?

yes, but it pointless at this stage-- see below.

> However, we do need to identify that a character is a
> pictogram/ideogram/word-character/whatever you call them and therefore needs
> to be indexed in its own right.

Yes ... That's why Chinese won't work anyway.
We need a real tokenizer for that.

>> >> There doesn't seem to be any easy way to identify that a character does
> not
>> >> require spaces and therefore should be indexed in its own right. :|
>> >
>> > Ya... that's why I just extended the original method to take more letter..
>
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
>

Reply via email to