I deduce the transaction range not using the segment corrupted but the corrected segments. The transaction id is incremental and i imagine segment are saved sequentelly so if it is missing the segment 5 , reading the correct segment 4 i can find the maximunn transaction id A , reading the segment 6 i can find the minimum transaction id B so i can deduce the hole , the range is [A+1,B-1] ... making a query in db i reaload the corrisponding document and i add again in lucene this missing documents.
2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto < cristian.lorenze...@gmail.com>: > I deduce the transaction range not using the segment corrupted but the > corrected segments. The transaction id is incremental and i imagine segment > are saved sequentelly so if it is missing the segment 5 , reading the > correct segment 4 i can find the maximunn transaction id A , reading the > segment 6 i can find the minimum transaction id B so i can deduce the hole > , the range is [A+1,B-1] ... making a query in db i reaload the > corrisponding document and i add again in lucene this missing documents. > > > 2017-03-23 15:17 GMT+01:00 Michael McCandless <luc...@mikemccandless.com>: > >> Lucene corruption should be rare and only due to bad hardware; if you are >> seeing otherwise we really should get to the root cause. >> >> Mapping documents to each segment will not be easy in general, especially >> if that segment is now corrupted so you can't search it. >> >> Documents lost because of power loss / OS crash while indexing can be >> more common, and its for that use case that the sequence numbers / >> transaction log should be helpful. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto < >> cristian.lorenze...@gmail.com> wrote: >> >>> Yes exactly. I saw, working in the past in systems using lucene (for >>> example alfresco projects), lucene corruption happens sometimes and every >>> time the building requires a lot of times ... so i thougth a way for >>> accelerating the fixing of a corruption index. In addition there is a rare >>> case not described here ( If after a database commit lucene throws a >>> exception for exampe disk is full ) there is a possibility of a >>> disalignement from the database and the lucene index. With this system >>> these problems could be solved automatically. In database every row has a >>> property with trasaction id. So if i know in lucene is missing a segment 6 >>> , corrisponds to transactions range[ 1000, 1050] so i can reload in a >>> query in database just corrisponding rows. >>> >>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <luc...@mikemccandless.com >>> >: >>> >>>> You should be able to use the sequence numbers returned by IndexWriter >>>> operations to "know" which operations made it into the commit and which did >>>> not, and then on disaster recovery replay only those operations that didn't >>>> make it? >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto < >>>> cristian.lorenze...@gmail.com> wrote: >>>> >>>>> Errata corridge/integration for questions related to previous my post >>>>> >>>>> I studied a bit this lucene classes for understanding: >>>>> 1) setCommitData is designed for versioning the index , not for >>>>> passing a transaction log. However if userdata is different for every >>>>> transactionid it is equivalent . >>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I >>>>> based my implementation using nrt on http://stackoverflow.com/qu >>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth >>>>> read-sample-usage. In this example commit is executed for every crud >>>>> operation in synchronous way but in general it is advised to use a batch >>>>> thread because the commit is a long operation. *So it is not clear >>>>> how to do the commit in a near-real time system with a indefinite index >>>>> size.* >>>>> 2.a if the commit is synchronous , i can use user data because it >>>>> is used before a commit, every commit has a different user data and i can >>>>> trace the transactions changes.But in general a commit can requires also >>>>> minutes for be completed so then it dont seams a real solution in a near >>>>> real time solution. >>>>> 2.b if the commit is async, it is executed every X times (or >>>>> better how memory if full) , the commit can not be used for tracing the >>>>> transactions and i can pass a trnsaction id associated with a lucene >>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure >>>>> the last uncummit Index is aligned to the last transaction id X, so there >>>>> is no overlappind and the crud block is very fast when happens.But how to >>>>> grant that the commit is related to the last CommitIndex what i loaded? >>>>> Maybe if i introduce that mutex in a custom mergePolicy? >>>>> It is right what i wrote until now ?The best solution is 2.b? In this >>>>> case how to grant the commit is done based on the uncommit data loaded in >>>>> a >>>>> specific commitIndex? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless < >>>>> luc...@mikemccandless.com>: >>>>> >>>>>> Hi, I think you forgot to CC the lucene user's list ( >>>>>> java-user@lucene.apache.org) in your reply? Can you resend? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto < >>>>>> cristian.lorenze...@gmail.com> wrote: >>>>>> >>>>>>> hi , i m thinking about what you told me in previous message and how >>>>>>> to solve the corruption problem and the problem about commit operation >>>>>>> executed in async way. >>>>>>> >>>>>>> I m thinking to create a simple transaction log in a file. >>>>>>> i use a long atomic sequence for a ordinable transaction id. >>>>>>> >>>>>>> when i make a new operation >>>>>>> 1) generate new incremental transaction id >>>>>>> 2) save the operation abstract info in transaction log associated to >>>>>>> id. >>>>>>> 2.a insert ,update with the a serialized version of the object >>>>>>> to save >>>>>>> 2b delete the query serialized where apply delete >>>>>>> 3) execute same operation in lucene adding before property >>>>>>> transactionId (executed in ram) >>>>>>> >>>>>>> 4) in async way commit is executed. After the commit the transaction >>>>>>> log until last transaction id is deleted.(i dont know how insert block >>>>>>> after commit , using near real time reader and SearcherManager) I might >>>>>>> introduce a logic in the way a commit is done. The order is simlilar >>>>>>> to a >>>>>>> queue so it follows the transactionId order. i Is there a example about >>>>>>> possibility to commit a specific set of uncommit operations? >>>>>>> >>>>>>> 5) i need the warrenty after a crud operation the data in available >>>>>>> in memory in a possible imminent research so i think i might execute >>>>>>> flush/refreshReader after every CUD operations >>>>>>> >>>>>>> if there is a failure transaction log will be not empty. But i can >>>>>>> rexecute operations not executed after restartup. >>>>>>> Maybe it could be usefull also for fixing a corruption but it is >>>>>>> sure the corrution dont touch also segments already commited completely >>>>>>> in >>>>>>> the past? or maybe for a stable solution i might anyway save data in a >>>>>>> secondary repository ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> for your opinion this solution will be sufficient . It is a good >>>>>>> solution for you, i m forgetting some aspects? >>>>>>> >>>>>>> PS Another interesting aspect maybe could be associate the segment >>>>>>> associated to a transaction. In this way if a segment is missing i can >>>>>>> apply again it without rebuild all the index from scratch. >>>>>>> >>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless < >>>>>>> luc...@mikemccandless.com>: >>>>>>> >>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but >>>>>>>> this is quite brutal: it simply drops any segment that has corruption >>>>>>>> it >>>>>>>> detects. >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <m...@marcoreis.net> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to >>>>>>>>> maintain a >>>>>>>>> backup policy because of that. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto < >>>>>>>>> cristian.lorenze...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> > lucene can rebuild index using his internal info and how ? or in >>>>>>>>> have to >>>>>>>>> > reinsert all in other way? >>>>>>>>> > >>>>>>>>> -- >>>>>>>>> Marco Reis >>>>>>>>> Software Architect >>>>>>>>> http://marcoreis.net >>>>>>>>> https://github.com/masreis >>>>>>>>> +55 61 9 81194620 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >