You are right , but maybe it is possible to solve this problem. I can try :) i m not sure but in NRT , using a single commiter it is a single batch thread executing the commits so it might be sequential.
I think your case is when 2 segments are not merged and contains changes in the same entities. I imagine this case can happens until the segments are not merged in a unique segment. So pratically if i add additional info about deletion , there is no risk to consume too much disk, because is a temporary status, not accumulative. in db i have special table "deletion_table" deletion table transactionId , entityId document A inserted in segment 1 with transaction 5 document B inserted in segment 1 with transaction 5 document C inserted in segment 1 with trnsaction 6 document A is deleted in segment 2 with transaction 7 and i save in deletion table 7->A until segments are merged segment 2 is corrupted searching range [7 , *] in db i search transaction 7, but i dont find in the document tables, i search in deletion table 7->A. I check entity A is not present in document tables , so i can deduce i have to recall delete entity A in lucene removing this document. 2017-03-23 16:18 GMT+01:00 Michael McCandless <luc...@mikemccandless.com>: > If you use a single thread then, yes, segments are sequential. > > But if e.g. you are updating documents, then deletions (because a document > was replaced) are recorded against different segments, so merely dropping > the corrupted segment will mean you don't drop the deletions. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Mar 23, 2017 at 10:29 AM, Cristian Lorenzetto < > cristian.lorenze...@gmail.com> wrote: > >> I deduce the transaction range not using the segment corrupted but the >> corrected segments. The transaction id is incremental and i imagine segment >> are saved sequentelly so if it is missing the segment 5 , reading the >> correct segment 4 i can find the maximunn transaction id A , reading the >> segment 6 i can find the minimum transaction id B so i can deduce the hole >> , the range is [A+1,B-1] ... making a query in db i reaload the >> corrisponding document and i add again in lucene this missing documents. >> >> >> 2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto < >> cristian.lorenze...@gmail.com>: >> >>> I deduce the transaction range not using the segment corrupted but the >>> corrected segments. The transaction id is incremental and i imagine segment >>> are saved sequentelly so if it is missing the segment 5 , reading the >>> correct segment 4 i can find the maximunn transaction id A , reading the >>> segment 6 i can find the minimum transaction id B so i can deduce the hole >>> , the range is [A+1,B-1] ... making a query in db i reaload the >>> corrisponding document and i add again in lucene this missing documents. >>> >>> >>> 2017-03-23 15:17 GMT+01:00 Michael McCandless <luc...@mikemccandless.com >>> >: >>> >>>> Lucene corruption should be rare and only due to bad hardware; if you >>>> are seeing otherwise we really should get to the root cause. >>>> >>>> Mapping documents to each segment will not be easy in general, >>>> especially if that segment is now corrupted so you can't search it. >>>> >>>> Documents lost because of power loss / OS crash while indexing can be >>>> more common, and its for that use case that the sequence numbers / >>>> transaction log should be helpful. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto < >>>> cristian.lorenze...@gmail.com> wrote: >>>> >>>>> Yes exactly. I saw, working in the past in systems using lucene (for >>>>> example alfresco projects), lucene corruption happens sometimes and every >>>>> time the building requires a lot of times ... so i thougth a way for >>>>> accelerating the fixing of a corruption index. In addition there is a rare >>>>> case not described here ( If after a database commit lucene throws a >>>>> exception for exampe disk is full ) there is a possibility of a >>>>> disalignement from the database and the lucene index. With this system >>>>> these problems could be solved automatically. In database every row has a >>>>> property with trasaction id. So if i know in lucene is missing a segment >>>>> 6 >>>>> , corrisponds to transactions range[ 1000, 1050] so i can reload in a >>>>> query in database just corrisponding rows. >>>>> >>>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless < >>>>> luc...@mikemccandless.com>: >>>>> >>>>>> You should be able to use the sequence numbers returned by >>>>>> IndexWriter operations to "know" which operations made it into the commit >>>>>> and which did not, and then on disaster recovery replay only those >>>>>> operations that didn't make it? >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto < >>>>>> cristian.lorenze...@gmail.com> wrote: >>>>>> >>>>>>> Errata corridge/integration for questions related to previous my post >>>>>>> >>>>>>> I studied a bit this lucene classes for understanding: >>>>>>> 1) setCommitData is designed for versioning the index , not for >>>>>>> passing a transaction log. However if userdata is different for every >>>>>>> transactionid it is equivalent . >>>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I >>>>>>> based my implementation using nrt on http://stackoverflow.com/qu >>>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth >>>>>>> read-sample-usage. In this example commit is executed for every >>>>>>> crud operation in synchronous way but in general it is advised to use a >>>>>>> batch thread because the commit is a long operation. *So it is not >>>>>>> clear how to do the commit in a near-real time system with a indefinite >>>>>>> index size.* >>>>>>> 2.a if the commit is synchronous , i can use user data because >>>>>>> it is used before a commit, every commit has a different user data and i >>>>>>> can trace the transactions changes.But in general a commit can requires >>>>>>> also minutes for be completed so then it dont seams a real solution in a >>>>>>> near real time solution. >>>>>>> 2.b if the commit is async, it is executed every X times (or >>>>>>> better how memory if full) , the commit can not be used for tracing the >>>>>>> transactions and i can pass a trnsaction id associated with a lucene >>>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m >>>>>>> sure >>>>>>> the last uncummit Index is aligned to the last transaction id X, so >>>>>>> there >>>>>>> is no overlappind and the crud block is very fast when happens.But how >>>>>>> to >>>>>>> grant that the commit is related to the last CommitIndex what i loaded? >>>>>>> Maybe if i introduce that mutex in a custom mergePolicy? >>>>>>> It is right what i wrote until now ?The best solution is 2.b? In >>>>>>> this case how to grant the commit is done based on the uncommit data >>>>>>> loaded >>>>>>> in a specific commitIndex? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless < >>>>>>> luc...@mikemccandless.com>: >>>>>>> >>>>>>>> Hi, I think you forgot to CC the lucene user's list ( >>>>>>>> java-user@lucene.apache.org) in your reply? Can you resend? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto < >>>>>>>> cristian.lorenze...@gmail.com> wrote: >>>>>>>> >>>>>>>>> hi , i m thinking about what you told me in previous message and >>>>>>>>> how to solve the corruption problem and the problem about commit >>>>>>>>> operation >>>>>>>>> executed in async way. >>>>>>>>> >>>>>>>>> I m thinking to create a simple transaction log in a file. >>>>>>>>> i use a long atomic sequence for a ordinable transaction id. >>>>>>>>> >>>>>>>>> when i make a new operation >>>>>>>>> 1) generate new incremental transaction id >>>>>>>>> 2) save the operation abstract info in transaction log associated >>>>>>>>> to id. >>>>>>>>> 2.a insert ,update with the a serialized version of the object >>>>>>>>> to save >>>>>>>>> 2b delete the query serialized where apply delete >>>>>>>>> 3) execute same operation in lucene adding before property >>>>>>>>> transactionId (executed in ram) >>>>>>>>> >>>>>>>>> 4) in async way commit is executed. After the commit the >>>>>>>>> transaction log until last transaction id is deleted.(i dont know how >>>>>>>>> insert block after commit , using near real time reader and >>>>>>>>> SearcherManager) I might introduce a logic in the way a commit is >>>>>>>>> done. >>>>>>>>> The order is simlilar to a queue so it follows the transactionId >>>>>>>>> order. i >>>>>>>>> Is there a example about possibility to commit a specific set of >>>>>>>>> uncommit >>>>>>>>> operations? >>>>>>>>> >>>>>>>>> 5) i need the warrenty after a crud operation the data in >>>>>>>>> available in memory in a possible imminent research so i think i >>>>>>>>> might >>>>>>>>> execute flush/refreshReader after every CUD operations >>>>>>>>> >>>>>>>>> if there is a failure transaction log will be not empty. But i can >>>>>>>>> rexecute operations not executed after restartup. >>>>>>>>> Maybe it could be usefull also for fixing a corruption but it is >>>>>>>>> sure the corrution dont touch also segments already commited >>>>>>>>> completely in >>>>>>>>> the past? or maybe for a stable solution i might anyway save data in a >>>>>>>>> secondary repository ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> for your opinion this solution will be sufficient . It is a good >>>>>>>>> solution for you, i m forgetting some aspects? >>>>>>>>> >>>>>>>>> PS Another interesting aspect maybe could be associate the segment >>>>>>>>> associated to a transaction. In this way if a segment is missing i can >>>>>>>>> apply again it without rebuild all the index from scratch. >>>>>>>>> >>>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless < >>>>>>>>> luc...@mikemccandless.com>: >>>>>>>>> >>>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option >>>>>>>>>> but this is quite brutal: it simply drops any segment that has >>>>>>>>>> corruption >>>>>>>>>> it detects. >>>>>>>>>> >>>>>>>>>> Mike McCandless >>>>>>>>>> >>>>>>>>>> http://blog.mikemccandless.com >>>>>>>>>> >>>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <m...@marcoreis.net> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to >>>>>>>>>>> maintain a >>>>>>>>>>> backup policy because of that. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto < >>>>>>>>>>> cristian.lorenze...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> > lucene can rebuild index using his internal info and how ? or >>>>>>>>>>> in have to >>>>>>>>>>> > reinsert all in other way? >>>>>>>>>>> > >>>>>>>>>>> -- >>>>>>>>>>> Marco Reis >>>>>>>>>>> Software Architect >>>>>>>>>>> http://marcoreis.net >>>>>>>>>>> https://github.com/masreis >>>>>>>>>>> +55 61 9 81194620 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >