On Tue, 13 Dec 2005, tsuraan wrote:

I was wondering how PyLucene does unicode support.  What I had been
doing was running python's string.decode function to get a unicode
object that I was passing to my c++ backend as a plain array of bytes
(I don't even know if this worked, but it seems reasonable).

PyLucene is based on Java Lucene and as such the java layer only accepts 16 bit unicode strings. When you pass a Python 'str' to a PyLuene API, it is assumed to be encoded in utf-8 and is as such converted to unicode for Java. If you pass in a Python unicode string, then, depending on the size of the Python unicode char on your platform, the chars are passed as-is or are converted (casted) to 16 bit. The 32 to 16 bit casting is likely to be bogus for unicode chars that have more than 16 significant bits. This is a known bug.

If I get a unicode string in a document I'm parsing, how do I search on that
string from python?  Do I just give the constructors to my Term
objects and QueryParsers unicode strings, and have them use that?  I
haven't had the courage to dive into the source yet, and I figured
this would probably be an easy question for you to answer :)

The str/unicode to Java Unicode back and forth conversion is handled automatically for you, and this is done by the p2j() and j2p() functions defined in PyLucene.i.

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to