[pylucene-dev] Re: Bug in j2p

Andi Vajda Fri, 06 Jan 2006 09:52:34 -0800


On Fri, 6 Jan 2006, tsuraan wrote:

I'm noticing another issue, this time in p2j.  I'm indexing emails for
search, and often times I get the exception that the given string is
not unicode or utf-8.  In the p2j function, JvNewStringUTF is the only
conversion attempted, and the exception is thrown if that fails to
return a string.  Would it make sense to try a JvNewStringLatin1 if
the UTF function fails?  I don't know much about encodings, but I was
just thinking that maybe a lot of the messages I'm seeing that aren't
valid UTF messages might be valid Latin1 strings (whatever those are).
Does that make sense as a legitimate solution?  I'm going to give it
a try to see if the problem goes away, but I'd like to hear what you
think about it.

The 'str' objects passed to PyLucene are expected to be 'utf-8' (or a subsetthereof such as 'ascii') encoded. If they are not, you need to make themunicode strings in python first, before passing them on to PyLucene. Only youhave the actual knowledge what the correct encoding might be. Guessingencodings is fraught with issues. In the email domain (and web, rss, etc...)you also have to work around the fact that the encoding data claims to be inmay be incorrect or bogus altogether. It is therefore much safer if you takecare, in python, of converting the 'str' to 'unicode' instance using theunicode('str', 'encoding name') constructor (for example) before passing themto PyLucene. Java only works with Unicode, think of PyLucene's accepting of

'utf-8' encoded str objects as mere convenience.

It also looks like p2j suffers from the same stack allocation flaw youreported yesterday. I should be fixing this shortly.


Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] Re: Bug in j2p

Reply via email to