[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781420#action_12781420 ]
Robert Muir commented on LUCENE-1458: ------------------------------------- {quote} I realize a java String can easily contain an unpaired surrogate (eg, your test case) since it operates in code units not code points, but, that's not valid unicode, right? {quote} it is valid unicode. it is a valid "Unicode String". This is different than a Term stored in the index, which will be stored as UTF-8, and thus purports to be in a valid unicode encoding form. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89. D89: Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. • For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a wellformed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is. {quote} But how would a search application based on an east asian language actually create such a term? In what situation would an unpaired surrogate find its way down to TermEnum? {quote} I gave an example already, where they use FuzzyQuery with say a prefix of one. with the current code, even in the flex branch!!! this will create a lead surrogate prefix. There is code in the lucene core that does things like this (which I plan to fix, and also try to preserve back compat!) This makes it impossible to preserve back compat. There is also probably a lot of non-lucene east asian code that does similar things. For example, someone with data from Hong Kong almost certainly encounters suppl. characters, because they are part of Big5-HKSCS. They may not be smart enough to know about this situation, i.e. they might take a string, substring(0, 1) and do a prefix query. right now this will work! This is part of the idea that for most operations (such as prefix), in java, supplementary characters work rather transparently. If we do this, upgrading lucene to support for unicode 4.0 will be significantly more difficult. bq. OK, can you shed some more light on how/when your apps do this? Yes, see LUCENE-1606. This library uses UTF-16 intervals for transitions, which works fine because for its matching purposes, this is transparent. So there is no need for it to be aware of suppl. characters. If we make this change, I will need to refactor/rewrite a lot of this code, most likely the underlying DFA library itself. This is working in production for me, on chinese text outside of the BMP with lucene right now. With this change, it will no longer work, and the enumerator will most likely go into an infinite loop! The main difference here is semantics, before IndexReader.terms() accepted as input any Unicode String. Now it would tighten that restriction to only any interchangeable UTF-8 string. Yet the input being used, will not be stored as UTF-8 anywhere, and most certainly will not be interchanged! The paper i sent on UTF-16 mentions problems like this, because its very reasonable and handy to use code units for processing, since suppl. characters are so rare. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org