Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Jason Corekin Sat, 14 Dec 2013 14:59:20 -0800

Mike,

Thanks for the input, it will take me some time to digest and trying
everything you wrote about.  I will post back the answers to your questions
and results to from the suggestions you made once I have gone over
everything.  Thanks for the quick reply,


Jason


On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It sounds like there are at least two issues.
>
> First, that it takes so long to do the delete.
>
> Unfortunately, deleting by Term is at heart a costly operation.  It
> entails up to one disk seek per segment in your index; a custom
> Directory impl that makes seeking costly would slow things down, or if
> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
> impl is using the OS).  Is seeking somehow costly in your custom Dir
> impl?
>
> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
> per Term, which may actually be expected.
>
> How many terms in your index?  Can you run CheckIndex and post the output?
>
> You could index your ID field using MemoryPostingsFormat, which should
> be a good speedup, but will consume more RAM.
>
> Is it possible to delete by query instead?  Ie, create a query that
> matches the 460K docs and pass that to
> IndexWriter.deleteDocuments(Query).
>
> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
> into smaller chunks.  Lucene buffers up all deleted terms from one
> call, and then applies them, so my guess is you're using way too much
> intermediate memory by passign 460K in a single call.
>
> Instead of indexing everything into one index, and then deleting tons
> of docs to "clone" to a new index, why not just index to two separate
> indices to begin with?
>
> The second issue is that after all that work, nothing in fact changed.
>  For that, I think you should make a small test case that just tries
> to delete one document, and iterate/debug until that works.  Your
> StringField indexing line looks correct; make sure you're passing
> precisely the same field name and value?  Make sure you're not
> deleting already-deleted documents?  (Your for loop seems to ignore
> already deleted documents).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <jason.core...@gmail.com>
> wrote:
> > I knew that I had forgotten something.  Below is the line that I use to
> > create the field that I am trying to use to delete the entries with.  I
> > hope this avoids some confusion.  Thank you very much to anyone that
> takes
> > the time to read these messages.
> >
> > doc.add(new StringField("FileName",filename, Field.Store.YES));
> >
> >
> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.core...@gmail.com
> >wrote:
> >
> >> Let me start by stating that I almost certain that I am doing something
> >> wrong, and that I hope that I am because if not there is a VERY large
> bug
> >> in Lucene.   What I am trying to do is use the method
> >>
> >>
> >> deleteDocuments(Term... terms)
> >>
> >>
> >>  out of the IndexWriter class to delete several Term object Arrays, each
> >> fed to it via a separate Thread.  Each array has around 460k+ Term
> object
> >> in it.  The issue is that after running for around 30 minutes or more
> the
> >> method finishes, I then have a commit run and nothing changes with my
> files.
> >> To be fair, I am running a custom Directory implementation that might be
> >> causing problems, but I do not think that this is the case as I do not
> even
> >> see any of the my Directory methods in the stack trace.  In fact when I
> >> set break points inside the delete methods of my Directory
> implementation
> >> they never even get hit. To be clear replacing the custom Directory
> >> implementation with a standard one is not an option due to the nature of
> >> the data which is made up of terabytes of small (1k and less) files.
>  So,
> >> if the issue is in the Directory implementation I have to figure out
> how to
> >> fix it.
> >>
> >>
> >> Below are the pieces of code that I think are relevant to this issue as
> >> well as a copy of the stack trace thread that was doing work when I
> paused
> >> the debug session.  As you are likely to notice, the thread is called a
> >> DBCloner because it is being used to clone the underlying Index based
> >> database (needed to avoid storing trillions of files directly on disk).
>  The
> >> idea is to duplicate the selected group of terms into a new database and
> >> then delete to original terms from the original database.  The duplicate
> >> work wonderfully, but not matter what I do including cutting the program
> >> down to one thread I cannot shrink the database and the time to try to
> do
> >> the deletes takes drastically too long.
> >>
> >>
> >> In an attempt to be as helpful as possible, I will say this.  I have
> been
> >> tracing this problem for a few days and have seen that
> >>
> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
> >>
> >> is where that majority of the execution time is spent.  I have also
> >> noticed that this method return false MUCH more often than it returns
> true.
> >> I have been trying to figure out how the mechanics of this process work
> >> just in case the issue was not in my code and I might have been able  to
> >> find the problem.  But I have yet to find the problem either in Lucene
> >> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be
> doing
> >> wrong, I would really appreciate reading what you have to say.  Thanks
> in
> >> advance.
> >>
> >>
> >>
> >> Jason
> >>
> >>
> >>
> >>                 private void cloneDB() throws QueryNodeException {
> >>
> >>
> >>
> >>                                 Document doc;
> >>
> >>                                 ArrayList<String> fileNames;
> >>
> >>                                 int start = docRanges[(threadNumber *
> 2)];
> >>
> >>                                 int stop = docRanges[(threadNumber * 2)
> +
> >> 1];
> >>
> >>
> >>
> >>                                 try {
> >>
> >>
> >>
> >>                                                 fileNames = new
> >> ArrayList<String>(docsPerThread);
> >>
> >>                                                 for (int i = start; i <
> >> stop; i++) {
> >>
> >>                                                                 doc =
> >> searcher.doc(i);
> >>
> >>                                                                 try {
> >>
> >>
> >> adder.addDoc(doc);
> >>
> >>
> >> fileNames.add(doc.get("FileName"));
> >>
> >>                                                                 } catch
> >> (TransactionExceptionRE | TransactionException | LockConflictException
> te) {
> >>
> >>
> >> adder.txnAbort();
> >>
> >>
> >> System.err.println(Thread.currentThread().getName() + ": Adding a
> message
> >> failed, retrying.");
> >>
> >>                                                                 }
> >>
> >>                                                 }
> >>
> >>
> deleters[threadNumber].deleteTerms("FileName",
> >> fileNames);
> >>
> >>
> >> deleters[threadNumber].commit();
> >>
> >>
> >>
> >>                                 } catch (IOException | ParseException
> ex)
> >> {
> >>
> >>
> Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
> >> null, ex);
> >>
> >>                                 }
> >>
> >>                 }
> >>
> >>
> >>
> >>
> >>
> >>                                 public void deleteTerms(String
> >> dbField,ArrayList<String> fieldTexts) throws IOException {
> >>
> >>                                 Term[] terms = new
> >> Term[fieldTexts.size()];
> >>
> >>                                 for(int i=0;i<fieldTexts.size();i++){
> >>
> >>                                                 terms[i]= new
> >> Term(dbField,fieldTexts.get(i));
> >>
> >>                                 }
> >>
> >>                                 writer.deleteDocuments(terms);
> >>
> >>                 }
> >>
> >>
> >>
> >>                 public void deleteDocuments(Term... terms) throws
> >> IOException
> >>
> >>
> >>
> >>
> >>
> >>                 Thread [DB Cloner 2] (Suspended)
> >>
> >>                 owns: BufferedUpdatesStream  (id=54)
> >>
> >>                 owns: IndexWriter  (id=49)
> >>
> >>                 FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
> >> line: 979
> >>
> >>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
> >> line: 1220
> >>
> >>
> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
> >> line: 1679
> >>
> >>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
> >> ReadersAndUpdates, SegmentReader) line: 414
> >>
> >>                 BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
> >> List<SegmentCommitInfo>) line: 283
> >>
> >>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
> >>
> >>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
> >>
> >>
> >>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
> >> boolean, boolean) line: 673
> >>
> >>                 IndexWriter.processEvents(Queue<Event>, boolean,
> boolean)
> >> line: 4665
> >>
> >>                 IndexWriter.processEvents(boolean, boolean) line: 4657
> >>
> >>
> >>                 IndexWriter.deleteDocuments(Term...) line: 1421
> >>
> >>                 DocDeleter.deleteTerms(String, ArrayList<String>) line:
> 95
> >>
> >>
> >>                 DBCloner.cloneDB() line: 233
> >>
> >>                 DBCloner.run() line: 133
> >>
> >>                 Thread.run() line: 744
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Reply via email to