Re: Replica goes into recovery mode in Solr 6.1.0

Ere Maijala Fri, 10 Jul 2020 03:58:12 -0700

vishal patel kirjoitti 10.7.2020 klo 12.45:
> Thanks for your input.
> 
> Walter already said that setting soft commit max time to 100 ms is a recipe 
> for disaster
>>> I know that but our application is already developed and run on live 
>>> environment since last 5 years. Actually, we want to show a data very 
>>> quickly after the insert.
> 
> you have huge JVM heaps without an explanation for the reason
>>> We gave the 55GB ram because our usage is like that large query search and 
>>> very frequent searching and indexing.
> Here is my memory snapshot which I have taken from GC.


Yes, I can see that a lot of memory is in use, but the question is why.
I assume caches (are they too large?), perhaps uninverted indexes.
Docvalues would help with latter ones. Do you use them?

> I have tried Solr upgrade from 6.1.0 to 8.5.1 but due to some issue we cannot 
> do. I have also asked in here
> https://lucene.472066.n3.nabble.com/Sorting-in-other-collection-in-Solr-8-5-1-td4459506.html#a4459562

You could also try upgrading to the latest version in 6.x series as a
starter.

> Why we cannot find the reason of recovery from log? like memory or CPU issue, 
> frequent index or search, large query hit,
> My log at the time of recovery
> https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view
> [https://lh5.googleusercontent.com/htOUfpihpAqncFsMlCLnSUZPu1_9DRKGNajaXV1jG44fpFzgx51ecNtUK58m5lk=w1200-h630-p]<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
> recovery_shard.txt<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
> drive.google.com

Isn't it right there on the first lines?

2020-07-09 14:42:43.943 ERROR
(updateExecutor-2-thread-21007-processing-http:////11.200.212.305:8983//solr//products
x:products r:core_node1 n:11.200.212.306:8983_solr s:shard1 c:products)
[c:products s:shard1 r:core_node1 x:products]
o.a.s.u.StreamingSolrClients error
org.apache.http.NoHttpResponseException: 11.200.212.305:8983 failed to
respond

followed by a couple more error messages about the same problem and then
initiation of recovery:

2020-07-09 14:42:44.002 INFO  (qtp1239731077-771611) [c:products
s:shard1 r:core_node1 x:products] o.a.s.c.ZkController Put replica
core=products coreNodeName=core_node3 on 11.200.212.305:8983_solr into
leader-initiated recovery.

So the node in question isn't responding quickly enough to http requests
and gets put into recovery. The log for the recovering node starts too
late, so I can't say anything about what happened before 14:42:43.943
that lead to recovery.

--Ere

> 
> ________________________________
> From: Ere Maijala <ere.maij...@helsinki.fi>
> Sent: Friday, July 10, 2020 2:10 PM
> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
> 
> Walter already said that setting soft commit max time to 100 ms is a
> recipe for disaster. That alone can be the issue, but if you're not
> willing to try higher values, there's no way of being sure. And you have
> huge JVM heaps without an explanation for the reason. If those do not
> cause problems, you indicated that you also run some other software on
> the same server. Is it possible that the other processes hog CPU, disk
> or network and starve Solr?
> 
> I must add that Solr 6.1.0 is over four years old. You could be hitting
> a bug that has been fixed for years, but even if you encounter an issue
> that's still present, you will need to uprgade to get it fixed. If you
> look at the number of fixes done in subsequent 6.x versions alone in the
> changelog (https://lucene.apache.org/solr/8_5_1/changes/Changes.html)
> you'll see that there are a lot of them. You could be hitting something
> like SOLR-10420, which has been fixed for over three years.
> 
> Best,
> Ere
> 
> vishal patel kirjoitti 10.7.2020 klo 7.52:
>> I’ve been running Solr for a dozen years and I’ve never needed a heap larger 
>> than 8 GB.
>>>> What is your data size? same like us 1 TB? is your searching or indexing 
>>>> frequently? NRT model?
>>
>> My question is why replica is going into recovery? When replica went down, I 
>> checked GC log but GC pause was not more than 2 seconds.
>> Also, I cannot find out any reason for recovery from Solr log file. i want 
>> to know the reason why replica goes into recovery.
>>
>> Regards,
>> Vishal Patel
>> ________________________________
>> From: Walter Underwood <wun...@wunderwood.org>
>> Sent: Friday, July 10, 2020 3:03 AM
>> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>>
>> Those are extremely large JVMs. Unless you have proven that you MUST
>> have 55 GB of heap, use a smaller heap.
>>
>> I’ve been running Solr for a dozen years and I’ve never needed a heap
>> larger than 8 GB.
>>
>> Also, there is usually no need to use one JVM per replica.
>>
>> Your configuration is using 110 GB (two JVMs) just for Java
>> where I would configure it with a single 8 GB JVM. That would
>> free up 100 GB for file caches.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Jul 8, 2020, at 10:10 PM, vishal patel <vishalpatel200...@outlook.com> 
>>> wrote:
>>>
>>> Thanks for reply.
>>>
>>> what you mean by "Shard1 Allocated memory”
>>>>> It means JVM memory of one solr node or instance.
>>>
>>> How many Solr JVMs are you running?
>>>>> In one server 2 solr JVMs in which one is shard and other is replica.
>>>
>>> What is the heap size for your JVMs?
>>>>> 55GB of one Solr JVM.
>>>
>>> Regards,
>>> Vishal Patel
>>>
>>> Sent from Outlook<http://aka.ms/weboutlook>
>>> ________________________________
>>> From: Walter Underwood <wun...@wunderwood.org>
>>> Sent: Wednesday, July 8, 2020 8:45 PM
>>> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
>>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>>>
>>> I don’t understand what you mean by "Shard1 Allocated memory”. I don’t know 
>>> of
>>> any way to dedicate system RAM to an application object like a replica.
>>>
>>> How many Solr JVMs are you running?
>>>
>>> What is the heap size for your JVMs?
>>>
>>> Setting soft commit max time to 100 ms does not magically make Solr super 
>>> fast.
>>> It makes Solr do too much work, makes the work queues fill up, and makes it 
>>> fail.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Jul 7, 2020, at 10:55 PM, vishal patel <vishalpatel200...@outlook.com> 
>>>> wrote:
>>>>
>>>> Thanks for your reply.
>>>>
>>>> One server has total 320GB ram. In this 2 solr node one is shard1 and 
>>>> second is shard2 replica. Each solr node have 55GB memory allocated. 
>>>> shard1 has 585GB data and shard2 replica has 492GB data. means almost 1TB 
>>>> data in this server. server has also other applications and for that 60GB 
>>>> memory allocated. So total 150GB memory is left.
>>>>
>>>> Proper formatting details:
>>>> https://drive.google.com/file/d/1K9JyvJ50Vele9pPJCiMwm25wV4A6x4eD/view
>>>>
>>>> Are you running multiple huge JVMs?
>>>>>> Not huge but 60GB memory allocated for our 11 application. 150GB memory 
>>>>>> are still free.
>>>>
>>>> The servers will be doing a LOT of disk IO, so look at the read and write 
>>>> iops. I expect that the solr processes are blocked on disk reads almost 
>>>> all the time.
>>>>>> is it chance to go in recovery mode if more IO read and write or blocked?
>>>>
>>>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>>>>> Our requirement is NRT so we keep the less time
>>>>
>>>> Regards,
>>>> Vishal Patel
>>>> ________________________________
>>>> From: Walter Underwood <wun...@wunderwood.org>
>>>> Sent: Tuesday, July 7, 2020 8:15 PM
>>>> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
>>>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>>>>
>>>> This isn’t a support list, so nobody looks at issues. We do try to help.
>>>>
>>>> It looks like you have 1 TB of index on a system with 320 GB of RAM.
>>>> I don’t know what "Shard1 Allocated memory” is, but maybe half of
>>>> that RAM is used by JVMs or some other process, I guess. Are you
>>>> running multiple huge JVMs?
>>>>
>>>> The servers will be doing a LOT of disk IO, so look at the read and
>>>> write iops. I expect that the solr processes are blocked on disk reads
>>>> almost all the time.
>>>>
>>>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>>> That is probably causing your outages.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>> On Jul 7, 2020, at 5:18 AM, vishal patel <vishalpatel200...@outlook.com> 
>>>>> wrote:
>>>>>
>>>>> Any one is looking my issue? Please guide me.
>>>>>
>>>>> Regards,
>>>>> Vishal Patel
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: vishal patel <vishalpatel200...@outlook.com>
>>>>> Sent: Monday, July 6, 2020 7:11 PM
>>>>> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
>>>>> Subject: Replica goes into recovery mode in Solr 6.1.0
>>>>>
>>>>> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We 
>>>>> have 2 shards and each shard has 1 replica. We have 3 collection.
>>>>> We do not use any cache and also disable in Solr config.xml. Search and 
>>>>> Update requests are coming frequently in our live platform.
>>>>>
>>>>> *Our commit configuration in solr.config are below
>>>>> <autoCommit>
>>>>> <maxTime>600000</maxTime>
>>>>>     <maxDocs>20000</maxDocs>
>>>>>     <openSearcher>false</openSearcher>
>>>>> </autoCommit>
>>>>> <autoSoftCommit>
>>>>>     <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>>>> </autoSoftCommit>
>>>>>
>>>>> *We used Near Real Time Searching So we did below configuration in 
>>>>> solr.in.cmd
>>>>> set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100
>>>>>
>>>>> *Our collections details are below:
>>>>>
>>>>> Collection      Shard1  Shard1 Replica  Shard2  Shard2 Replica
>>>>> Number of Documents     Size(GB)        Number of Documents     Size(GB)  
>>>>>       Number of Documents     Size(GB)        Number of Documents     
>>>>> Size(GB)
>>>>> collection1     26913364        201     26913379        202     26913380  
>>>>>       198     26913379        198
>>>>> collection2     13934360        310     13934367        310     13934368  
>>>>>       219     13934367        219
>>>>> collection3     351539689       73.5    351540040       73.5    351540136 
>>>>>       75.2    351539722       75.2
>>>>>
>>>>> *My server configurations are below:
>>>>>
>>>>>      Server1 Server2
>>>>> CPU     Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 Mhz, 10 Core(s), 
>>>>> 20 Logical Processor(s)        Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 
>>>>> 2301 Mhz, 10 Core(s), 20 Logical Processor(s)
>>>>> HardDisk(GB)    3845 ( 3.84 TB) 3485 GB (3.48 TB)
>>>>> Total memory(GB)        320     320
>>>>> Shard1 Allocated memory(GB)     55
>>>>> Shard2 Replica Allocated memory(GB)     55
>>>>> Shard2 Allocated memory(GB)             55
>>>>> Shard1 Replica Allocated memory(GB)             55
>>>>> Other Applications Allocated Memory(GB) 60      22
>>>>> Other Number Of Applications    11      7
>>>>>
>>>>>
>>>>> Sometimes, any one replica goes into recovery mode. Why replica goes into 
>>>>> recovery? Due to heavy search OR heavy update/insert OR long GC pause 
>>>>> time? If any one of them then what should we do in configuration?
>>>>> Should we increase the shard for recovery issue?
>>>>>
>>>>> Regards,
>>>>> Vishal Patel
>>>>>
>>>>
>>>
>>
>>
> 
> --
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: Replica goes into recovery mode in Solr 6.1.0

Reply via email to