Hi Ken, On 5/27/07, Ken Harris <[EMAIL PROTECTED]> wrote: > For my Python/GTK/libbeagle program, I want to support Unicode fully > (ha), so I spent some time learning how to work with Unicode in C# > (where 'char' is only 16 bits -- d'oh!) for my Beagle filter. I > thought I had it all figured out...
So I am definitely not an expert in these matters. But my understanding is that Mono internally uses UTF-16 as its Unicode representation. When working with native libraries, by default Mono converts to UTF-8 when passing strings. GTK, which is the widget toolkit that beagle-search uses, requires that strings be in UTF-8. > When I couldn't make it work, I just made a plain text file with 3 > Latin characters, 3 Georgian characters, and 3 Linear B (i.e., > non-BMP) characters, and saved it as UTF-8. Then I fired up "Desktop > Search" / "beagle-search" (every app under GNOME seems to have two > names!) and tried searching by each triple. As I feared, Latin and > Georgian worked, but Linear B didn't. (From Python, it looks like > U+10000 is coming out as 2 ASCII spaces.) The big question here is: what part of the search is failing? There are lots of places this could be failing: in trying to analyze the characters into words, in the conversion to UTF-8 for sending it over the wire, in displaying the results, etc. Also, I have no idea how Python handles Unicode data (the last time I used it heavily -- in 2004 or so -- it didn't handle it very well). If you search using the command-line program beagle-query, do you find the files? > Does Beagle not support Unicode >3.0 yet? Is somebody working on it > already? Do Beagle's dependencies (like Lucene or Gtk#) handle newer > Unicode versions? (Hopefully it can be upgraded piecemeal, and not > one-huge-change-all-at-once.) As far as Beagle is concerned, by itself it doesn't deal with character encodings at all. As far as underlying libs: GTK requires UTF-8; underneath it GLib deals with different Unicode versions. Looking at the ChangeLog, it looks like it's had support for Unicode 3.0 since 2.0. (4.1 support was added October 2005 and is in 2.10.0; 5.0 was added July 2006 and went into 2.12.2) So we should be fine there. It's definitely possible that Lucene doesn't have any special handling of these characters. You might want to try running beagle-extract-content on the file to see if the data is extracted reasonably. Joe _______________________________________________ Dashboard-hackers mailing list [email protected] http://mail.gnome.org/mailman/listinfo/dashboard-hackers
