Re: how to rebuild a index corrupted?

Cristian Lorenzetto Thu, 23 Mar 2017 07:29:32 -0700

I deduce the transaction range not using the segment corrupted but the
corrected segments. The transaction id is incremental and i imagine segment
are saved sequentelly so if it is missing the segment 5 , reading the
correct segment 4 i can find the maximunn transaction id A , reading the
segment 6 i can find the minimum transaction id B so i can deduce the hole
, the range is [A+1,B-1] ... making a query in db i reaload the
corrisponding document and i add again in lucene this missing documents.



2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
[email protected]>:

> I deduce the transaction range not using the segment corrupted but the
> corrected segments. The transaction id is incremental and i imagine segment
> are saved sequentelly so if it is missing the segment 5 , reading the
> correct segment 4 i can find the maximunn transaction id A , reading the
> segment 6 i can find the minimum transaction id B so i can deduce the hole
> , the range is [A+1,B-1] ... making a query in db i reaload the
> corrisponding document and i add again in lucene this missing documents.
>
>
> 2017-03-23 15:17 GMT+01:00 Michael McCandless <[email protected]>:
>
>> Lucene corruption should be rare and only due to bad hardware; if you are
>> seeing otherwise we really should get to the root cause.
>>
>> Mapping documents to each segment will not be easy in general, especially
>> if that segment is now corrupted so you can't search it.
>>
>> Documents lost because of power loss / OS crash while indexing can be
>> more common, and its for that use case that the sequence numbers /
>> transaction log should be helpful.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>> [email protected]> wrote:
>>
>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>> example alfresco projects),  lucene corruption happens sometimes and every
>>> time the building requires a lot of times ... so i thougth a way for
>>> accelerating the fixing of a corruption index. In addition there is a rare
>>> case not described here ( If after a database commit lucene throws a
>>> exception for exampe disk is full ) there is a possibility of a
>>>  disalignement from the database and the lucene index. With this system
>>> these problems could be solved automatically. In database every row has a
>>> property with trasaction id.  So if i know in lucene is missing a segment 6
>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>> query in database just corrisponding rows.
>>>
>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <[email protected]
>>> >:
>>>
>>>> You should be able to use the sequence numbers returned by IndexWriter
>>>> operations to "know" which operations made it into the commit and which did
>>>> not, and then on disaster recovery replay only those operations that didn't
>>>> make it?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>> [email protected]> wrote:
>>>>
>>>>> Errata corridge/integration for questions related to previous my post
>>>>>
>>>>> I studied a bit this lucene classes for understanding:
>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>> passing a transaction log. However if userdata is different for every
>>>>> transactionid it is equivalent .
>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>> read-sample-usage. In this example commit is executed for every crud
>>>>> operation in synchronous way but in general it is advised to use a batch
>>>>> thread because the commit is a long operation. *So it is not clear
>>>>> how to do the commit in a near-real time system with a indefinite index
>>>>> size.*
>>>>>      2.a if the commit is synchronous , i can use user data because it
>>>>> is used before a commit, every commit has a different user data and i can
>>>>> trace the transactions changes.But in general a commit can requires also
>>>>> minutes for be completed so then it dont seams a real solution in a near
>>>>> real time solution.
>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>>>> the last uncummit Index is aligned to the last transaction id X, so there
>>>>> is no overlappind and the crud block is very fast when happens.But how to
>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>>>> case how to grant the commit is done based on the uncommit data loaded in 
>>>>> a
>>>>> specific commitIndex?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>> [email protected]>:
>>>>>
>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>> [email protected]) in your reply?  Can you resend?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> hi , i m thinking about what you told me in previous message and how
>>>>>>> to solve the corruption problem and the problem about commit operation
>>>>>>> executed in async way.
>>>>>>>
>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>
>>>>>>> when i make a new operation
>>>>>>> 1) generate new incremental transaction id
>>>>>>> 2) save the operation abstract info in transaction log associated to
>>>>>>> id.
>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>> to save
>>>>>>>     2b delete the query serialized where apply delete
>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>> transactionId (executed in ram)
>>>>>>>
>>>>>>> 4) in async way commit is executed. After the commit the transaction
>>>>>>> log until last transaction id is deleted.(i dont know how insert block
>>>>>>> after commit , using near real time reader and SearcherManager) I might
>>>>>>>  introduce a logic in the way a commit is done. The order is simlilar 
>>>>>>> to a
>>>>>>> queue so it follows the transactionId order. i Is there a example about
>>>>>>> possibility to commit a specific set of uncommit operations?
>>>>>>>
>>>>>>> 5) i need the warrenty after a crud operation the data in available
>>>>>>> in memory  in a possible imminent research so i think i might execute
>>>>>>> flush/refreshReader after every CUD operations
>>>>>>>
>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>> rexecute operations not executed after restartup.
>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>> sure the corrution dont touch also segments already commited completely 
>>>>>>> in
>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>> secondary repository ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>
>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>
>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>>>> this is quite brutal: it simply drops any segment that has corruption 
>>>>>>>> it
>>>>>>>> detects.
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>> maintain a
>>>>>>>>> backup policy because of that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>>>>>> have to
>>>>>>>>> > reinsert all in other way?
>>>>>>>>> >
>>>>>>>>> --
>>>>>>>>> Marco Reis
>>>>>>>>> Software Architect
>>>>>>>>> http://marcoreis.net
>>>>>>>>> https://github.com/masreis
>>>>>>>>> +55 61 9 81194620
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to rebuild a index corrupted?

Reply via email to