[ 
https://issues.apache.org/jira/browse/LUCENE-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414566#comment-13414566
 ] 

Robert Muir commented on LUCENE-4221:
-------------------------------------

This patch disables all the offsets checks for term vectors.

I'd like a plan to start enforcing this stuff in IndexWriter for term vectors 
as well so we can actually have these checks on at some point in the future. 
Sure maybe its annoying that things like ngrams violate all these rules and 
will fail if term vectors are on, but these are broken analyzers that need to 
be fixed and we shouldn't allow bogus data in the index.

The problem with the current situation (besides checkindex), is if someone has 
such bogus offsets in an older index
and they try to use something like Highlighter they will just trip errors from 
OffsetAttribute, etc. So they won't really work.

Best idea i have so far:
# Fix LUCENE-4180 so that we can differentiate between 4.0-alpha and 4.0-beta 
indexes
# Change default term vectors merge impl to buffer one doc in RAM, if it has 
invalid offsets, clear the offsets bit and dont write them.
# Only enable bulk merge for 4.x codec, when the segment was written by 
4.0-beta+, otherwise just call super.merge

One downside is that we must keep the one-doc buffering (part 2) even in trunk 
until 6.x to support 4.0-alpha indexes, but its too late now.

                
> CheckIndex is overeager for term vector offsets bounds checks
> -------------------------------------------------------------
>
>                 Key: LUCENE-4221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4221
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 4.0-ALPHA
>            Reporter: Robert Muir
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4221.patch
>
>
> In some situations (like running shingles twice), you end out with a case 
> where startOffset > endOffset.
> We prevent this in IndexWriter for postings offsets, but we never do any 
> validation here for term vectors (at some point, maybe we should make a plan 
> to address this?)
> Anyway, currently CheckIndex will wrongly fail in this situation, which some 
> of our own analyzers even do (e.g. LUCENE-3920)...
> This is an overly-eager validation in checkindex (for vectors, we cannot 
> safely do these assertions as it was/is never enforced by IndexWriter, only 
> for postings offsets).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to