Mike, Thanks for the input, it will take me some time to digest and trying everything you wrote about. I will post back the answers to your questions and results to from the suggestions you made once I have gone over everything. Thanks for the quick reply,
Jason On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > It sounds like there are at least two issues. > > First, that it takes so long to do the delete. > > Unfortunately, deleting by Term is at heart a costly operation. It > entails up to one disk seek per segment in your index; a custom > Directory impl that makes seeking costly would slow things down, or if > the OS doesn't have enough RAM to cache the "hot" pages (if your Dir > impl is using the OS). Is seeking somehow costly in your custom Dir > impl? > > If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec > per Term, which may actually be expected. > > How many terms in your index? Can you run CheckIndex and post the output? > > You could index your ID field using MemoryPostingsFormat, which should > be a good speedup, but will consume more RAM. > > Is it possible to delete by query instead? Ie, create a query that > matches the 460K docs and pass that to > IndexWriter.deleteDocuments(Query). > > Also, try passing fewer ids at once to Lucene, e.g. break the 460K > into smaller chunks. Lucene buffers up all deleted terms from one > call, and then applies them, so my guess is you're using way too much > intermediate memory by passign 460K in a single call. > > Instead of indexing everything into one index, and then deleting tons > of docs to "clone" to a new index, why not just index to two separate > indices to begin with? > > The second issue is that after all that work, nothing in fact changed. > For that, I think you should make a small test case that just tries > to delete one document, and iterate/debug until that works. Your > StringField indexing line looks correct; make sure you're passing > precisely the same field name and value? Make sure you're not > deleting already-deleted documents? (Your for loop seems to ignore > already deleted documents). > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <jason.core...@gmail.com> > wrote: > > I knew that I had forgotten something. Below is the line that I use to > > create the field that I am trying to use to delete the entries with. I > > hope this avoids some confusion. Thank you very much to anyone that > takes > > the time to read these messages. > > > > doc.add(new StringField("FileName",filename, Field.Store.YES)); > > > > > > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.core...@gmail.com > >wrote: > > > >> Let me start by stating that I almost certain that I am doing something > >> wrong, and that I hope that I am because if not there is a VERY large > bug > >> in Lucene. What I am trying to do is use the method > >> > >> > >> deleteDocuments(Term... terms) > >> > >> > >> out of the IndexWriter class to delete several Term object Arrays, each > >> fed to it via a separate Thread. Each array has around 460k+ Term > object > >> in it. The issue is that after running for around 30 minutes or more > the > >> method finishes, I then have a commit run and nothing changes with my > files. > >> To be fair, I am running a custom Directory implementation that might be > >> causing problems, but I do not think that this is the case as I do not > even > >> see any of the my Directory methods in the stack trace. In fact when I > >> set break points inside the delete methods of my Directory > implementation > >> they never even get hit. To be clear replacing the custom Directory > >> implementation with a standard one is not an option due to the nature of > >> the data which is made up of terabytes of small (1k and less) files. > So, > >> if the issue is in the Directory implementation I have to figure out > how to > >> fix it. > >> > >> > >> Below are the pieces of code that I think are relevant to this issue as > >> well as a copy of the stack trace thread that was doing work when I > paused > >> the debug session. As you are likely to notice, the thread is called a > >> DBCloner because it is being used to clone the underlying Index based > >> database (needed to avoid storing trillions of files directly on disk). > The > >> idea is to duplicate the selected group of terms into a new database and > >> then delete to original terms from the original database. The duplicate > >> work wonderfully, but not matter what I do including cutting the program > >> down to one thread I cannot shrink the database and the time to try to > do > >> the deletes takes drastically too long. > >> > >> > >> In an attempt to be as helpful as possible, I will say this. I have > been > >> tracing this problem for a few days and have seen that > >> > >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef) > >> > >> is where that majority of the execution time is spent. I have also > >> noticed that this method return false MUCH more often than it returns > true. > >> I have been trying to figure out how the mechanics of this process work > >> just in case the issue was not in my code and I might have been able to > >> find the problem. But I have yet to find the problem either in Lucene > >> 4.5.1 or Lucene 4.6. If anyone has any ideas as to what I might be > doing > >> wrong, I would really appreciate reading what you have to say. Thanks > in > >> advance. > >> > >> > >> > >> Jason > >> > >> > >> > >> private void cloneDB() throws QueryNodeException { > >> > >> > >> > >> Document doc; > >> > >> ArrayList<String> fileNames; > >> > >> int start = docRanges[(threadNumber * > 2)]; > >> > >> int stop = docRanges[(threadNumber * 2) > + > >> 1]; > >> > >> > >> > >> try { > >> > >> > >> > >> fileNames = new > >> ArrayList<String>(docsPerThread); > >> > >> for (int i = start; i < > >> stop; i++) { > >> > >> doc = > >> searcher.doc(i); > >> > >> try { > >> > >> > >> adder.addDoc(doc); > >> > >> > >> fileNames.add(doc.get("FileName")); > >> > >> } catch > >> (TransactionExceptionRE | TransactionException | LockConflictException > te) { > >> > >> > >> adder.txnAbort(); > >> > >> > >> System.err.println(Thread.currentThread().getName() + ": Adding a > message > >> failed, retrying."); > >> > >> } > >> > >> } > >> > >> > deleters[threadNumber].deleteTerms("FileName", > >> fileNames); > >> > >> > >> deleters[threadNumber].commit(); > >> > >> > >> > >> } catch (IOException | ParseException > ex) > >> { > >> > >> > Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE, > >> null, ex); > >> > >> } > >> > >> } > >> > >> > >> > >> > >> > >> public void deleteTerms(String > >> dbField,ArrayList<String> fieldTexts) throws IOException { > >> > >> Term[] terms = new > >> Term[fieldTexts.size()]; > >> > >> for(int i=0;i<fieldTexts.size();i++){ > >> > >> terms[i]= new > >> Term(dbField,fieldTexts.get(i)); > >> > >> } > >> > >> writer.deleteDocuments(terms); > >> > >> } > >> > >> > >> > >> public void deleteDocuments(Term... terms) throws > >> IOException > >> > >> > >> > >> > >> > >> Thread [DB Cloner 2] (Suspended) > >> > >> owns: BufferedUpdatesStream (id=54) > >> > >> owns: IndexWriter (id=49) > >> > >> FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader) > >> line: 979 > >> > >> FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader) > >> line: 1220 > >> > >> > BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef) > >> line: 1679 > >> > >> BufferedUpdatesStream.applyTermDeletes(Iterable<Term>, > >> ReadersAndUpdates, SegmentReader) line: 414 > >> > >> BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool, > >> List<SegmentCommitInfo>) line: 283 > >> > >> IndexWriter.applyAllDeletesAndUpdates() line: 3112 > >> > >> IndexWriter.applyDeletesAndPurge(boolean) line: 4641 > >> > >> > >> DocumentsWriter$ApplyDeletesEvent.process(IndexWriter, > >> boolean, boolean) line: 673 > >> > >> IndexWriter.processEvents(Queue<Event>, boolean, > boolean) > >> line: 4665 > >> > >> IndexWriter.processEvents(boolean, boolean) line: 4657 > >> > >> > >> IndexWriter.deleteDocuments(Term...) line: 1421 > >> > >> DocDeleter.deleteTerms(String, ArrayList<String>) line: > 95 > >> > >> > >> DBCloner.cloneDB() line: 233 > >> > >> DBCloner.run() line: 133 > >> > >> Thread.run() line: 744 > >> > >> > >> > >> > >> > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >