Re: how to rebuild a index corrupted?

Cristian Lorenzetto Thu, 23 Mar 2017 09:24:24 -0700

You are right , but maybe it is possible to solve this problem. I can try :)
i m not sure but in NRT , using a single commiter it is a single batch
thread executing the commits so it might be sequential.


 I think your case is when 2 segments are not merged and contains changes
in the same entities. I imagine this case can happens until the segments
are not merged in a unique segment. So pratically if i add additional info
about deletion , there is no risk to consume too much disk, because is a
temporary status, not accumulative.
in db i have special table "deletion_table"
deletion table

transactionId , entityId




document A inserted in segment 1 with transaction 5
document B inserted in segment 1 with transaction 5
document C inserted in segment 1 with trnsaction 6
document A   is deleted in segment 2 with transaction 7 and i save in
deletion table 7->A until segments are merged

segment 2 is corrupted

searching range [7 , *]

in db i search transaction 7, but i dont find in the document tables, i
search in deletion table
7->A. I check entity A is not present in document tables , so i can deduce
i have to recall delete entity A in lucene removing this document.











2017-03-23 16:18 GMT+01:00 Michael McCandless <[email protected]>:

> If you use a single thread then, yes, segments are sequential.
>
> But if e.g. you are updating documents, then deletions (because a document
> was replaced) are recorded against different segments, so merely dropping
> the corrupted segment will mean you don't drop the deletions.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 23, 2017 at 10:29 AM, Cristian Lorenzetto <
> [email protected]> wrote:
>
>> I deduce the transaction range not using the segment corrupted but the
>> corrected segments. The transaction id is incremental and i imagine segment
>> are saved sequentelly so if it is missing the segment 5 , reading the
>> correct segment 4 i can find the maximunn transaction id A , reading the
>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>> , the range is [A+1,B-1] ... making a query in db i reaload the
>> corrisponding document and i add again in lucene this missing documents.
>>
>>
>> 2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
>> [email protected]>:
>>
>>> I deduce the transaction range not using the segment corrupted but the
>>> corrected segments. The transaction id is incremental and i imagine segment
>>> are saved sequentelly so if it is missing the segment 5 , reading the
>>> correct segment 4 i can find the maximunn transaction id A , reading the
>>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>>> , the range is [A+1,B-1] ... making a query in db i reaload the
>>> corrisponding document and i add again in lucene this missing documents.
>>>
>>>
>>> 2017-03-23 15:17 GMT+01:00 Michael McCandless <[email protected]
>>> >:
>>>
>>>> Lucene corruption should be rare and only due to bad hardware; if you
>>>> are seeing otherwise we really should get to the root cause.
>>>>
>>>> Mapping documents to each segment will not be easy in general,
>>>> especially if that segment is now corrupted so you can't search it.
>>>>
>>>> Documents lost because of power loss / OS crash while indexing can be
>>>> more common, and its for that use case that the sequence numbers /
>>>> transaction log should be helpful.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>>>> [email protected]> wrote:
>>>>
>>>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>>>> example alfresco projects),  lucene corruption happens sometimes and every
>>>>> time the building requires a lot of times ... so i thougth a way for
>>>>> accelerating the fixing of a corruption index. In addition there is a rare
>>>>> case not described here ( If after a database commit lucene throws a
>>>>> exception for exampe disk is full ) there is a possibility of a
>>>>>  disalignement from the database and the lucene index. With this system
>>>>> these problems could be solved automatically. In database every row has a
>>>>> property with trasaction id.  So if i know in lucene is missing a segment 
>>>>> 6
>>>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>>>> query in database just corrisponding rows.
>>>>>
>>>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <
>>>>> [email protected]>:
>>>>>
>>>>>> You should be able to use the sequence numbers returned by
>>>>>> IndexWriter operations to "know" which operations made it into the commit
>>>>>> and which did not, and then on disaster recovery replay only those
>>>>>> operations that didn't make it?
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Errata corridge/integration for questions related to previous my post
>>>>>>>
>>>>>>> I studied a bit this lucene classes for understanding:
>>>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>>>> passing a transaction log. However if userdata is different for every
>>>>>>> transactionid it is equivalent .
>>>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>>>> read-sample-usage. In this example commit is executed for every
>>>>>>> crud operation in synchronous way but in general it is advised to use a
>>>>>>> batch thread because the commit is a long operation. *So it is not
>>>>>>> clear how to do the commit in a near-real time system with a indefinite
>>>>>>> index size.*
>>>>>>>      2.a if the commit is synchronous , i can use user data because
>>>>>>> it is used before a commit, every commit has a different user data and i
>>>>>>> can trace the transactions changes.But in general a commit can requires
>>>>>>> also minutes for be completed so then it dont seams a real solution in a
>>>>>>> near real time solution.
>>>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m 
>>>>>>> sure
>>>>>>> the last uncummit Index is aligned to the last transaction id X, so 
>>>>>>> there
>>>>>>> is no overlappind and the crud block is very fast when happens.But how 
>>>>>>> to
>>>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>>>> It is right what i wrote until now ?The best solution is 2.b? In
>>>>>>> this case how to grant the commit is done based on the uncommit data 
>>>>>>> loaded
>>>>>>> in a specific commitIndex?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>>>> [email protected]) in your reply?  Can you resend?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> hi , i m thinking about what you told me in previous message and
>>>>>>>>> how to solve the corruption problem and the problem about commit 
>>>>>>>>> operation
>>>>>>>>> executed in async way.
>>>>>>>>>
>>>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>>>
>>>>>>>>> when i make a new operation
>>>>>>>>> 1) generate new incremental transaction id
>>>>>>>>> 2) save the operation abstract info in transaction log associated
>>>>>>>>> to id.
>>>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>>>> to save
>>>>>>>>>     2b delete the query serialized where apply delete
>>>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>>>> transactionId (executed in ram)
>>>>>>>>>
>>>>>>>>> 4) in async way commit is executed. After the commit the
>>>>>>>>> transaction log until last transaction id is deleted.(i dont know how
>>>>>>>>> insert block after commit , using near real time reader and
>>>>>>>>> SearcherManager) I might  introduce a logic in the way a commit is 
>>>>>>>>> done.
>>>>>>>>> The order is simlilar to a queue so it follows the transactionId 
>>>>>>>>> order. i
>>>>>>>>> Is there a example about possibility to commit a specific set of 
>>>>>>>>> uncommit
>>>>>>>>> operations?
>>>>>>>>>
>>>>>>>>> 5) i need the warrenty after a crud operation the data in
>>>>>>>>> available in memory  in a possible imminent research so i think i 
>>>>>>>>> might
>>>>>>>>> execute flush/refreshReader after every CUD operations
>>>>>>>>>
>>>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>>>> rexecute operations not executed after restartup.
>>>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>>>> sure the corrution dont touch also segments already commited 
>>>>>>>>> completely in
>>>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>>>> secondary repository ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>>>
>>>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>>>
>>>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>>>> [email protected]>:
>>>>>>>>>
>>>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option
>>>>>>>>>> but this is quite brutal: it simply drops any segment that has 
>>>>>>>>>> corruption
>>>>>>>>>> it detects.
>>>>>>>>>>
>>>>>>>>>> Mike McCandless
>>>>>>>>>>
>>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>>>> maintain a
>>>>>>>>>>> backup policy because of that.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > lucene can rebuild index using his internal info and how ? or
>>>>>>>>>>> in have to
>>>>>>>>>>> > reinsert all in other way?
>>>>>>>>>>> >
>>>>>>>>>>> --
>>>>>>>>>>> Marco Reis
>>>>>>>>>>> Software Architect
>>>>>>>>>>> http://marcoreis.net
>>>>>>>>>>> https://github.com/masreis
>>>>>>>>>>> +55 61 9 81194620
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to rebuild a index corrupted?

Reply via email to