[ 
https://issues.apache.org/jira/browse/LUCENENET-188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Digy updated LUCENENET-188:
---------------------------

    Attachment: TestIndexInput.patch

{quote} {color:red} 
The Java programming language, which uses UTF-16 for its internal text 
representation, supports a non-standard modification of UTF-8 for string 
serialization. This encoding is called modified UTF-8. There are two 
differences between modified and standard UTF-8. The first difference is that 
the null character (U+0000) is encoded with two bytes instead of one, 
specifically as 11000000 10000000. This ensures that there are no embedded 
nulls in the encoded string, presumably to address the concern that if the 
encoded string is processed in a language such as C where a null byte signifies 
the end of a string.
{color} {quote}

This explains the difference. Java treats c080 as null char but .Net as invalid 
char.

DIGY

> Index/TestIndexInput/TestRead fails -  (invalid UTF8 sequence).
> ---------------------------------------------------------------
>
>                 Key: LUCENENET-188
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-188
>             Project: Lucene.Net
>          Issue Type: Bug
>         Environment: Lucene.Net 2.4.0
>            Reporter: Digy
>            Priority: Trivial
>         Attachments: IndexInput.patch, TestIndexInput.patch
>
>
> This test fails since  "System.Text.Encoding.UTF8.GetString(bytes, 0, 
> length)" emits \ufffd char for invalid UTF-8 sequences and Java's 
> "String(bytes, 0, length, "UTF-8")" outputs \x00.
> I will attach a very bad implemented patch to show the problem but won't 
> commit it unless a clever (and performant) solution is found.
> DIGY.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to