[ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772133#action_12772133
 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. Steven, the only reason I might disagree is that a Lucene Index is supposed 
to be portable across different languages other than Lucene Java.

Right, but not all Lucene indexes in-the-wild are accessed from more than one 
language.  The vast majority of Lucene index uses, I'd venture to guess, are 
single-language, single-process uses.

bq. in my opinion, if you are to store process-internal codepoints as abstract 
characters in terms, then you should not claim that Lucene indexes are in any 
Unicode format, because then they violate the standard.

I strongly disagree with the assumption that interchange and serialization are 
synonymous.

bq. By *not* storing them in terms, then you are free to use them as 
delimiters, or other purposes. right now U+FFFF is used as a delimiter, but who 
knows, maybe someday you might need more?

I actually agree with this argument.  What if Lucene needs more 
process-internal characters?  I don't have any way of gauging the probability 
that it will in the future (other than the last eight years of history, during 
which only one was deemed necessary).  But what does Mike M. say? "Design for 
now" or something like that?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to