[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781420#action_12781420
 ] 

Robert Muir commented on LUCENE-1458:
-------------------------------------

{quote}
I realize a java String can easily contain an unpaired surrogate (eg,
your test case) since it operates in code units not code points, but,
that's not valid unicode, right?
{quote}

it is valid unicode. it is a valid "Unicode String". This is different than a 
Term stored in the index, which will be stored as UTF-8, and thus purports to 
be in a valid unicode encoding form.

However,
the conformance clauses do not prevent processes from operating on code
unit sequences that do not purport to be in a Unicode character encoding form.
For example, for performance reasons a low-level string operation may simply
operate directly on code units, without interpreting them as characters. See,
especially, the discussion under D89.

D89:
Unicode strings need not contain well-formed code unit sequences under all 
conditions.
This is equivalent to saying that a particular Unicode string need not be in a 
Unicode
encoding form.
• For example, it is perfectly reasonable to talk about an operation that takes 
the
two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which
contains an ill-formed UTF-16 code unit sequence, and concatenates them to
form another Unicode string <004D D800 DF02 004D>, which contains a wellformed
UTF-16 code unit sequence. The first two Unicode strings are not in
UTF-16, but the resultant Unicode string is.

{quote}
But how would a search application based on an east asian language
actually create such a term? In what situation would an unpaired
surrogate find its way down to TermEnum?
{quote}
I gave an example already, where they use FuzzyQuery with say a prefix of one. 
with the current code, even in the flex branch!!! this will create a lead 
surrogate prefix.
There is code in the lucene core that does things like this (which I plan to 
fix, and also try to preserve back compat!)
This makes it impossible to preserve back compat.

There is also probably a lot of non-lucene east asian code that does similar 
things.
For example, someone with data from Hong Kong almost certainly encounters 
suppl. characters, because
they are part of Big5-HKSCS. They may not be smart enough to know about this 
situation, i.e. they might take a string, substring(0, 1) and do a prefix query.
right now this will work!

This is part of the idea that for most operations (such as prefix), in java, 
supplementary characters work rather transparently.
If we do this, upgrading lucene to support for unicode 4.0 will be 
significantly more difficult.

bq. OK, can you shed some more light on how/when your apps do this?

Yes, see LUCENE-1606. This library uses UTF-16 intervals for transitions, which 
works fine because for its matching purposes, this is transparent.
So there is no need for it to be aware of suppl. characters. If we make this 
change, I will need to refactor/rewrite a lot of this code, most likely the 
underlying DFA library itself.
This is working in production for me, on chinese text outside of the BMP with 
lucene right now. With this change, it will no longer work, and the enumerator 
will most likely go into an infinite loop!

The main difference here is semantics, before IndexReader.terms() accepted as 
input any Unicode String. Now it would tighten that restriction to only any 
interchangeable UTF-8 string. Yet the input being used, will not be stored as 
UTF-8 anywhere, and most certainly will not be interchanged! The paper i sent 
on UTF-16 mentions problems like this, because its very reasonable and handy to 
use code units for processing, since suppl. characters are so rare.


> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to