On Fri, 6 Jan 2006, tsuraan wrote:
I'm noticing another issue, this time in p2j. I'm indexing emails for
search, and often times I get the exception that the given string is
not unicode or utf-8. In the p2j function, JvNewStringUTF is the only
conversion attempted, and the exception is thrown if that fails to
return a string. Would it make sense to try a JvNewStringLatin1 if
the UTF function fails? I don't know much about encodings, but I was
just thinking that maybe a lot of the messages I'm seeing that aren't
valid UTF messages might be valid Latin1 strings (whatever those are).
Does that make sense as a legitimate solution? I'm going to give it
a try to see if the problem goes away, but I'd like to hear what you
think about it.
The 'str' objects passed to PyLucene are expected to be 'utf-8' (or a subset
thereof such as 'ascii') encoded. If they are not, you need to make them
unicode strings in python first, before passing them on to PyLucene. Only you
have the actual knowledge what the correct encoding might be. Guessing
encodings is fraught with issues. In the email domain (and web, rss, etc...)
you also have to work around the fact that the encoding data claims to be in
may be incorrect or bogus altogether. It is therefore much safer if you take
care, in python, of converting the 'str' to 'unicode' instance using the
unicode('str', 'encoding name') constructor (for example) before passing them
to PyLucene. Java only works with Unicode, think of PyLucene's accepting of
'utf-8' encoded str objects as mere convenience.
It also looks like p2j suffers from the same stack allocation flaw you
reported yesterday. I should be fixing this shortly.
Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev