On Fri, 6 Jan 2006, tsuraan wrote:

I'm noticing another issue, this time in p2j.  I'm indexing emails for
search, and often times I get the exception that the given string is
not unicode or utf-8.  In the p2j function, JvNewStringUTF is the only
conversion attempted, and the exception is thrown if that fails to
return a string.  Would it make sense to try a JvNewStringLatin1 if
the UTF function fails?  I don't know much about encodings, but I was
just thinking that maybe a lot of the messages I'm seeing that aren't
valid UTF messages might be valid Latin1 strings (whatever those are).
Does that make sense as a legitimate solution?  I'm going to give it
a try to see if the problem goes away, but I'd like to hear what you
think about it.

The 'str' objects passed to PyLucene are expected to be 'utf-8' (or a subset thereof such as 'ascii') encoded. If they are not, you need to make them unicode strings in python first, before passing them on to PyLucene. Only you have the actual knowledge what the correct encoding might be. Guessing encodings is fraught with issues. In the email domain (and web, rss, etc...) you also have to work around the fact that the encoding data claims to be in may be incorrect or bogus altogether. It is therefore much safer if you take care, in python, of converting the 'str' to 'unicode' instance using the unicode('str', 'encoding name') constructor (for example) before passing them to PyLucene. Java only works with Unicode, think of PyLucene's accepting of
'utf-8' encoded str objects as mere convenience.

It also looks like p2j suffers from the same stack allocation flaw you reported yesterday. I should be fixing this shortly.

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to