Re: Unicode?

Joe Shaw Tue, 29 May 2007 10:33:21 -0700

Hi Ken,

On 5/27/07, Ken Harris <[EMAIL PROTECTED]> wrote:
> For my Python/GTK/libbeagle program, I want to support Unicode fully
> (ha), so I spent some time learning how to work with Unicode in C#
> (where 'char' is only 16 bits -- d'oh!) for my Beagle filter.  I
> thought I had it all figured out...


So I am definitely not an expert in these matters.  But my
understanding is that Mono internally uses UTF-16 as its Unicode
representation.

When working with native libraries, by default Mono converts to UTF-8
when passing strings.  GTK, which is the widget toolkit that
beagle-search uses, requires that strings be in UTF-8.

> When I couldn't make it work, I just made a plain text file with 3
> Latin characters, 3 Georgian characters, and 3 Linear B (i.e.,
> non-BMP) characters, and saved it as UTF-8.  Then I fired up "Desktop
> Search" / "beagle-search" (every app under GNOME seems to have two
> names!) and tried searching by each triple.  As I feared, Latin and
> Georgian worked, but Linear B didn't.  (From Python, it looks like
> U+10000 is coming out as 2 ASCII spaces.)

The big question here is: what part of the search is failing?  There
are lots of places this could be failing: in trying to analyze the
characters into words, in the conversion to UTF-8 for sending it over
the wire, in displaying the results, etc.  Also, I have no idea how
Python handles Unicode data (the last time I used it heavily -- in
2004 or so -- it didn't handle it very well).

If you search using the command-line program beagle-query, do you find
the files?

> Does Beagle not support Unicode >3.0 yet?  Is somebody working on it
> already?  Do Beagle's dependencies (like Lucene or Gtk#) handle newer
> Unicode versions?  (Hopefully it can be upgraded piecemeal, and not
> one-huge-change-all-at-once.)

As far as Beagle is concerned, by itself it doesn't deal with
character encodings at all.  As far as underlying libs: GTK requires
UTF-8; underneath it GLib deals with different Unicode versions.
Looking at the ChangeLog, it looks like it's had support for Unicode
3.0 since 2.0.  (4.1 support was added October 2005 and is in 2.10.0;
5.0 was added July 2006 and went into 2.12.2)  So we should be fine
there.

It's definitely possible that Lucene doesn't have any special handling
of these characters.  You might want to try running
beagle-extract-content on the file to see if the data is extracted
reasonably.

Joe
_______________________________________________
Dashboard-hackers mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/dashboard-hackers

Re: Unicode?

Reply via email to