[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005735#comment-14005735
 ] 

Shai Erera commented on LUCENE-5693:


So the question is, if we think the common case is to have close to 0 deleted 
documents in a new flushed segment, how important is it to not write the 
postings of those documents. For example, I believe that many applications 
include stored fields in their Documents, and probably a good portion also 
TermVectors. Since those two are big, but we're not removing deleted documents 
from them, what's the advantage of not writing on postings of deleted docs, at 
the cost of introducing potential bugs as Rob mentions?

And besides bugs, this does put the index into an inconsistent state -- yes, 
users should not rely on Lucene to be able to retrieve deleted documents, and 
we should have the freedom to optimize Lucene internals such that deleted docs 
are never flushed, but I just wonder if in this case, handling only postings 
while leaving the majority of the index as-is today, is worth the hassle and 
potential bugs.

I'm +0.5 to make this change to only postings.

If we can make this change global to all flushed content, including DocValues, 
I'd feel better about it. And I wonder if we can't ... so I understand this is 
not how the code works today, but if we computed the liveDocs before we flush 
any field, including DV, couldn't we change the code to not send deleted docs' 
fields to the DocValues API too? In that case we won't need to remap ordinals, 
right? We'd still be flushing stored fields and TV of deleted docs, but perhaps 
that's acceptable?

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005736#comment-14005736
 ] 

Michael McCandless commented on LUCENE-5693:


bq. So if we want to rename the issue to as a special case, don't write 
deleted postings on flush and remove the TODO about changing this for things 
like DV, then I'm fine.

+1, I'll rename the issue, remove the TODO about not sending deleted docs to 
DVs on flush (make it clear this is only about not writing deleted docs to 
postings), and add a separate test case for the ToParentBJQ.explain bug.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005740#comment-14005740
 ] 

Michael McCandless commented on LUCENE-5693:


bq. couldn't we change the code to not send deleted docs' fields to the 
DocValues API too?

We could, but Rob is strongly against that, so I'll remove that TODO and make 
it clear this is just about not wasting IO/CPU writing deleted postings.

Also, remember that postings is the most costly part of the merge, so not 
writing deleted docs there gets us the most gains.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005751#comment-14005751
 ] 

ASF subversion and git services commented on LUCENE-5693:
-

Commit 1596783 from [~mikemccand] in branch 'dev/branches/lucene5675'
[ https://svn.apache.org/r1596783 ]

LUCENE-5693, LUCENE-5675: also decouple this bug fix (move to LUCENE-5693) in 
ToParentBJQ.explain

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005754#comment-14005754
 ] 

Shai Erera commented on LUCENE-5693:


I don't see what's wrong w/ not sending the values of deleted docs to the DV 
API. As far as they're concerned, they get {{null}} values for 
Sorted/SortedSet/Binary/Numeric, which they should also be prepared for, as not 
all docs will have values even without deletes. And so there won't be any 
ord-remapping? Like if all docs associated w/ value foo are deleted, foo 
won't be sent to the Codec in the first place, and therefore will never receive 
an ord?

I wonder how complicated it is to patch it up, so we can look at it. And 
perhaps we'd even be able to tell if there's any code that breaks.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005787#comment-14005787
 ] 

Simon Willnauer commented on LUCENE-5693:
-

I really wonder if this issue matters. The usecase of this when you update a 
document while it's still in ram and to me this seems really a corner case. I 
think we should just stick with the corner case and not complicate the code if 
possible? If it makes things cleaner I am ok with it but please don't 
complicate things for this corner case...

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005795#comment-14005795
 ] 

Shai Erera commented on LUCENE-5693:


Hmm, I reviewed SortedDVWriter and I understand the ord remapping problem that 
Rob was talking about. I was confused and thought that the Codec is the one 
that assigns the ords, and so if it receives a {{null}} value, it would be 
fine. But it's the SortedDVWriter which assigns the ords, and since it already 
assigned an ord to a value of a document that was later deleted, it would need 
to remap the ords around it.

So maybe we do only need to focus on postings in this issue. I don't think that 
we need to remove the TODO though ... we have plenty of TODOs in the code, and 
it's something valid to consider one day, only keep this issue focused for now.



 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005828#comment-14005828
 ] 

Michael McCandless commented on LUCENE-5693:


bq. I really wonder if this issue matters. 

I suspect it's uncommon in most cases, that docs are born deleted.  But it 
does happen and it seems silly to waste IO/CPU if we can help it.

bq. I think we should just stick with the corner case and not complicate the 
code if possible?

The patch really does not complicate the code?  It adds a check against the 
liveDocs in the Docs/AndPositionsEnum passed to the codec during flush.  The 
only complexity was fixing a test that made invalid assumption that deleted 
docs must be present in postings.

I guess what bothers me here is this apparent precedent that deleted docs are 
in fact required to be present everywhere in a segment.  Yes, this is the case 
today, but I think it's an impl detail and should not be required, e.g. 
enforced by CheckIndex, tests asserting that it's the case.

But I'll resolve this as WONTFIX ... looks like I'm just outvoted.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005837#comment-14005837
 ] 

Shai Erera commented on LUCENE-5693:


bq. But I'll resolve this as WONTFIX ... looks like I'm just outvoted.

I think you jumped to that conclusion too soon. The way I read it, there's one 
committer who +0.5, one who was OK w/ restricting this to postings only, and 
one who said that if it complicates the code, please don't do that -- but it 
doesn't. I don't think that's called outvoted :). But it's your call...

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005862#comment-14005862
 ] 

Robert Muir commented on LUCENE-5693:
-

{quote}
I guess what bothers me here is this apparent precedent that deleted docs are 
in fact required to be present everywhere in a segment. Yes, this is the case 
today, but I think it's an impl detail and should not be required, e.g. 
enforced by CheckIndex, tests asserting that it's the case.
{quote}

Thats not the case. I am worried about *bugs, complexities, and slowdowns in 
lucene itself*. I already mentioned my list of concerns and I think they are 
all realistic.

To me, the patch is a bit naive.

Perhaps you forgot (or didn't think about) what Sorted/SortedSetDocValuesWriter 
would have to do, if it wanted to filter out deleted documents? This would slow 
down flushing a lot, which presumably is important to people who are deleting 
documents in IndexWriter's ramBuffer. Filtering out deleted documents here 
would only *hurt* the user. Better to leave this to merge.

And what about stored fields and term vectors? why wouldn't you put a TODO 
there in your patch? Is it because its ok to have the API and system 
inconsistency there, because it would be slower to buffer them in RAM?

I don't like these implicit exceptions to the rule. If we want to intentionally 
make a mess, there needs to be hard justification why we are doing such a 
thing. All these little unproven optimizations, API inconsistencies, 
exceptional cases, they all add up. I think it would be better to only 
complicate things when its a big win, otherwise the whole codebase will end out 
looking like IndexWriter.java.

All that being said, as I already stated on this issue, I am fine with 
filtering out the postings as an exception to the rule. I really don't like it 
one bit: but i can compromise for this piece if it really brings big benefits. 
Doing it for the rest of the codec api makes no sense at all.


 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006306#comment-14006306
 ] 

ASF subversion and git services commented on LUCENE-5693:
-

Commit 1596938 from [~mikemccand] in branch 'dev/branches/lucene5675'
[ https://svn.apache.org/r1596938 ]

LUCENE-5675, LUCENE-5693: improve javadocs, disallow term vectors, fix 
precommit issues, remove trivial diffs, add new test case

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004821#comment-14004821
 ] 

Robert Muir commented on LUCENE-5693:
-

This only makes sense for postings though.

How can we avoid writing deleted documents in:
* stored fields and term vectors (which we arent flushing)
* docvalues (we would need to remap ordinals)

By writing them some places and not writing them other places, we open the 
possibility of extremely confusing corner cases and bugs.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004835#comment-14004835
 ] 

Shai Erera commented on LUCENE-5693:


Today we apply the deletes (update the bitset) when a Reader is being 
requested. At that point, we have a SegmentReader at hand and we can resolve 
the delete-by-Term/Query to the actual doc IDs ... how would we do that while 
the segment is flushed? How do we know which documents were associated with 
{{Term t}}, while it was sent as a delete?

When I worked on LUCENE-5189 (NumericDocValues update), I had the same thought 
-- why flush the original numeric value when the document has already been 
updated? But I had the same issue - which documents were affected by the update 
Term.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004902#comment-14004902
 ] 

Michael McCandless commented on LUCENE-5693:


bq. how would we do that while the segment is flushed?

We do it in FreqProxTermsWriter.applyDeletes; since we know the terms to be 
deleted, and we have the BytesRefHash, it's easy.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004906#comment-14004906
 ] 

Michael McCandless commented on LUCENE-5693:


bq. This only makes sense for postings though.

Right, postings is much easier than doc values.  But postings are also the most 
costly to merge.

bq. By writing them some places and not writing them other places, we open the 
possibility of extremely confusing corner cases and bugs.

I disagree: I think we discover places that are relying on deleted docs 
behavior, i.e. test bugs.  When I did this on LUCENE-5675 there were only a few 
places that relied on deleted docs.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

2014-05-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005450#comment-14005450
 ] 

Robert Muir commented on LUCENE-5693:
-

{quote}
I disagree: I think we discover places that are relying on deleted docs 
behavior, i.e. test bugs. When I did this on LUCENE-5675 there were only a few 
places that relied on deleted docs.
{quote}

That's not the complexity i'm concerned about. I'm talking about bugs in lucene 
itself because shit like the following happens:
* various codec apis unable to cope with writing 0 doc segments because all the 
docs were deleted
* various codec apis with corner case bugs because stuff like 'maxdoc' in 
segmentinfo they are fed is inconsistent with what they saw.
* various index/search apis unable to cope with docid X appears in codec api Y 
but not codec api Z where its expected to exist.
* slow O(n) passes thru indexwriter apis to recalculate and reshuffle ordinals 
and stuff like that.
* corner case bugs like incorrect statistics.
* additional complexity inside indexwriter/codecs to handle this, when just 
merging away would be better.

So if we want to rename the issue to as a special case, don't write deleted 
postings on flush and remove the TODO about changing this for things like DV, 
then I'm fine.

But otherwise, if this is intended to be a precedent of how things should work, 
then I strongly feel we should not do this. The additional complexity and 
corner cases are simply not worth it.

 don't write deleted documents on flush
 --

 Key: LUCENE-5693
 URL: https://issues.apache.org/jira/browse/LUCENE-5693
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-5693.patch


 When we flush a new segment, sometimes some documents are born deleted, 
 e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
 documents.
 We already compute the liveDocs on flush, but then we continue (wastefully) 
 to send those known-deleted documents to all Codec parts.
 I started to implement this on LUCENE-5675 but it was too controversial.
 Also, I expect typically the number of deleted docs is 0, or small, so not 
 writing born deleted docs won't be much of a win for most apps.  Still it 
 seems silly to write them, consuming IO/CPU in the process, only to consume 
 more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org