[ 
https://issues.apache.org/jira/browse/SOLR-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089405#comment-16089405
 ] 

Shalin Shekhar Mangar commented on SOLR-10397:
----------------------------------------------

Thanks Dat.

bq. OverseerFailover is not guaranteed ( we should tackle this problem in 
another issue )

I've opened SOLR-11085 for improving resiliency of actions against overseer 
failures.

bq. AutoAddReplicas is triggered by NodeLost event, so when we switch 
autoAddReplicas from off to on nothing happen. I think this is ok.

I'm inclined to remove the quirk in how autoAddReplicas used to work and I 
don't think we need to support it. Please ensure that both the deprecation of 
the cluster property and the change in this behavior is documented in 
CHANGES.txt under the upgrade notes section.

A few things I noticed in the patch:
# Typo in AutoAddReplicasIntergrationTest (intergration instead of integration)
# same as above in HdfsAutoAddReplicasIntergrationTest
# There is a large block of code commented out in 
SharedFSAutoReplicaFailoverTest. Please remove it if it is no longer needed.
# The TestPolicy.testMoveReplicasInMultipleCollections does not seem like a 
very useful test. All it is testing is that some operation is returned. It 
should be testing that only the hinted collections' replicas are moved and that 
no operation is returned if there are no replicas belonging to the collection 
on the node that went down
# minor nit -- {{autoAddReplicas != null && autoAddReplicas.equals("false")}} 
can be simplified to {{!Boolean.parseBoolean(autoAddReplicas)}}
# typo in comment "Waitting" in waitForState calls in 
AutoAddReplicasIntergrationTest.testSimple
# The return value for {{waitForAllActiveAndLiveReplicas}} in the tests should 
be asserted to be true otherwise even after timeout the test silently proceeds 
to succeed.
# I am seeing some thread leak failures in HdfsAutoAddReplicasIntergrationTest:
{code}
NOTE: reproduce with: ant test  -Dtestcase=HdfsAutoAddReplicasIntergrationTest 
-Dtests.seed=EF1C283E3B67B9EE -Dtests.locale=mk-MK -Dtests.timezone=Etc/GMT-2 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8

Test ignored.

com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie 
threads that couldn't be terminated:
   1) Thread[id=685, name=ForkJoinPool.commonPool-worker-0, 
state=TIMED_WAITING, group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   2) Thread[id=686, name=ForkJoinPool.commonPool-worker-7, state=WAITING, 
group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   3) Thread[id=687, name=ForkJoinPool.commonPool-worker-1, state=WAITING, 
group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

        at __randomizedtesting.SeedInfo.seed([EF1C283E3B67B9EE]:0)
{code}

> Port 'autoAddReplicas' feature to the policy rules framework and make it work 
> with non-shared filesystems
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10397
>                 URL: https://issues.apache.org/jira/browse/SOLR-10397
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Cao Manh Dat
>              Labels: autoscaling
>             Fix For: 7.0
>
>         Attachments: SOLR-10397.1.patch, SOLR-10397.2.patch, SOLR-10397.patch
>
>
> Currently 'autoAddReplicas=true' can be specified in the Collection Create 
> API to automatically add replicas when a replica becomes unavailable. I 
> propose to move this feature to the autoscaling cluster policy rules design.
> This will include the following:
> * Trigger support for ‘nodeLost’ event type
> * Modification of existing implementation of ‘autoAddReplicas’ to 
> automatically create the appropriate ‘nodeLost’ trigger.
> * Any such auto-created trigger must be marked internally such that setting 
> ‘autoAddReplicas=false’ via the Modify Collection API should delete or 
> disable corresponding trigger.
> * Support for non-HDFS filesystems while retaining the optimization afforded 
> by HDFS i.e. the replaced replica can point to the existing data dir of the 
> old replica.
> * Deprecate/remove the feature of enabling/disabling ‘autoAddReplicas’ across 
> the entire cluster using cluster properties in favor of using the 
> suspend-trigger/resume-trigger APIs.
> This will retain backward compatibility for the most part and keep a common 
> use-case easy to enable as well as make it available to more people (i.e. 
> people who don't use HDFS).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to