[ 
https://issues.apache.org/jira/browse/SOLR-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998005#comment-13998005
 ] 

Shalin Shekhar Mangar commented on SOLR-5681:
---------------------------------------------

Some comments on the latest patch:

# The new createCollection method in CollectionAdminRequest is not required. In 
fact we should clean up the existing methods which hard code “implicit” router. 
I opened SOLR-6073 for it.
# There are some unrelated changes in CollectionHandler.handleRequestStatus()
# The added synchronisation in CoreAdminHandler.addTask is required. In fact it 
is a bug with the async work you did earlier and it should be fixed in 
trunk/branch_4x asap. We’re probably late for it to make into 4.8.1 but we 
should still try for it.
# The DistributedMap.size() method needlessly fetches all children. It can be 
implemented more efficiently using:
Stat stat = new Stat();
zookeeper.getData(dir, null, stat, true);
stat.getNumChildren();
# The ‘excludeList’ param in DistributedQueue.peekTopN should be named 
‘excludeSet’.
# DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty 
but it should actually check for orderedChildren.isEmpty instead. Otherwise the 
method will return null even if children were found in the second pass after 
waiting.
{code}
if (waitedEnough) {
          if (topN.isEmpty()) return null;
        }
{code}
# DistributedQueue.peekTopN has the following. Here the counter should be 
incremented only after topN.add(queueEvent) otherwise it either returns less 
nodes than requested and available or it waits more than required. For example, 
suppose children are (1,2,3,4,5), n=2 and excludeList=(1,2) then an extra await 
is invoked or if excludeList=(1,3) then only 2 is returned. In fact I think we 
should remove counter and just use topN.size() in the if condition. Also, is 
there any chance that headNode may be null?
{code}
for (String headNode : orderedChildren.values()) {
          if (headNode != null && counter++ < n) {
            try {
              String id = dir + "/" + headNode;
              if (excludeList != null && excludeList.contains(id)) continue;
              QueueEvent queueEvent = new QueueEvent(id,
                  zookeeper.getData(dir + "/" + headNode, null, null, true), 
null);
              topN.add(queueEvent);
            } catch (KeeperException.NoNodeException e) {
              // Another client removed the node first, try next
            }
          } else {
            if (topN.size() >= 1) {
              return topN;
            }
          }
        }
        if (topN.size() >= 1) {
          return topN;
        } else {
          childWatcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
          waitedEnough = wait != Long.MAX_VALUE;
          continue;
        }
{code}
# The DistributedQueue.peekTopN method catches and swallows the 
InterruptedException. We should just declare that it throws 
InterruptedException and let the caller deal with it.
# Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId()
# Do not swallow InterruptedException in DistributedQueue.getLastElementId()
# overseerCollectionProcessor.shutdown(); in Overseer.close() is not required 
because that is done by ccThread.close() already 
# There are formatting errors in success, error, time and storeFailureDetails 
methods in Overseer.Stats
# If the  maxParallelThreads is supposed to be a constant then it should 
renamed accordingly as MAX_PARALLEL_THREADS.
# The maxParallelThreads=10 is not actually used while creating the 
ThreadPoolExecutor. Instead it is initialised with 5-100 threads!
# Use this.processedZKTasks = Collections.synchronizedSet(new 
HashSet<String>()); to remove the unchecked cast warning in OCP constructor.
# Instead of passing a shardHandler to OCP constructor, why not just pass a 
shardHandlerFactory?
# Remove the e.printStackTrace in catch clauses in OCP.run()
# Do not swallow InterruptedException in OCP.run()
# In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not 
required
# What is the reason behind cleaning work queue twice and sleeping for 20ms in 
this code:
{code}
        cleanUpWorkQueue();

        while(runningTasks.size() > maxParallelThreads) {
          Thread.sleep(20);
        }

        cleanUpWorkQueue();

{code}
# There are unrelated changes in OCP.prioritizeOverseerNodes
# There are formatting problems in run(), checkExclusivity and cleanUpWorkQueue 
methods in OCP.
# We should check for asyncId != null in if (completedMap.contains(asyncId) || 
failureMap.contains(asyncId)) to avoid two unnecessary calls to ZK.
# KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored 
- Why would that happen? If it happens, why is it okay to ignore it? Shouldn’t 
we fail loudly or log a warning?

> Make the OverseerCollectionProcessor multi-threaded
> ---------------------------------------------------
>
>                 Key: SOLR-5681
>                 URL: https://issues.apache.org/jira/browse/SOLR-5681
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Anshum Gupta
>            Assignee: Anshum Gupta
>         Attachments: SOLR-5681-2.patch, SOLR-5681-2.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch
>
>
> Right now, the OverseerCollectionProcessor is single threaded i.e submitting 
> anything long running would have it block processing of other mutually 
> exclusive tasks.
> When OCP tasks become optionally async (SOLR-5477), it'd be good to have 
> truly non-blocking behavior by multi-threading the OCP itself.
> For example, a ShardSplit call on Collection1 would block the thread and 
> thereby, not processing a create collection task (which would stay queued in 
> zk) though both the tasks are mutually exclusive.
> Here are a few of the challenges:
> * Mutual exclusivity: Only let mutually exclusive tasks run in parallel. An 
> easy way to handle that is to only let 1 task per collection run at a time.
> * ZK Distributed Queue to feed tasks: The OCP consumes tasks from a queue. 
> The task from the workQueue is only removed on completion so that in case of 
> a failure, the new Overseer can re-consume the same task and retry. A queue 
> is not the right data structure in the first place to look ahead i.e. get the 
> 2nd task from the queue when the 1st one is in process. Also, deleting tasks 
> which are not at the head of a queue is not really an 'intuitive' thing.
> Proposed solutions for task management:
> * Task funnel and peekAfter(): The parent thread is responsible for getting 
> and passing the request to a new thread (or one from the pool). The parent 
> method uses a peekAfter(last element) instead of a peek(). The peekAfter 
> returns the task after the 'last element'. Maintain this request information 
> and use it for deleting/cleaning up the workQueue.
> * Another (almost duplicate) queue: While offering tasks to workQueue, also 
> offer them to a new queue (call it volatileWorkQueue?). The difference is, as 
> soon as a task from this is picked up for processing by the thread, it's 
> removed from the queue. At the end, the cleanup is done from the workQueue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to