[
https://issues.apache.org/jira/browse/LUCENE-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000435#comment-13000435
]
Robert Muir commented on LUCENE-2942:
-------------------------------------
Uwe: my plan is to actually fix toString itself (toString should be human
readable, thats its purpose!)
The existing code should be bytesToString() or hexToString() or something of
that nature,
this way if you explicitly want bytes you can get that.
{quote}
Internally it could quickly be implemneted as calling utf8ToString() and
fallback on Exception. Or is there a faster was to detect if its valid UTF-8?
{quote}
Not really, you cannot trust the JRE to do this correctly, e.g.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052
Additionally the behavior of malformed bytes is undefined, e.g. IBM JREs use
IGNORE but Sun JREs use REPLACE... even if they actually detected correctly :)
Don't worry I will take care of this part.
> toString() methods on term/queries/etc are wrong: assume utf-8 encoded bytes.
> -----------------------------------------------------------------------------
>
> Key: LUCENE-2942
> URL: https://issues.apache.org/jira/browse/LUCENE-2942
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 4.0
> Reporter: Robert Muir
>
> In Lucene's trunk, a Term is just a Bytesref.
> In a lot of cases this is a UTF-8 encoded string, but in some cases its not
> (e.g. collation fields).
> The problem is that the toString methods all currently call utf8ToString().
> This is wrong, though from a practical point of view i think just printing
> the bytes won't be very helpful for debugging most cases where the bytes
> really are utf-8 encoded.
> So i think in these cases we should use the following technique: if the bytes
> are a valid utf-8 sequence, use BytesRef.utf8tostring(), otherwise just print
> the bytes: BytesRef.toString()
> its no problem for performance because toString is only for debugging anyway.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]