[ https://issues.apache.org/jira/browse/SOLR-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998005#comment-13998005 ]
Shalin Shekhar Mangar commented on SOLR-5681: --------------------------------------------- Some comments on the latest patch: # The new createCollection method in CollectionAdminRequest is not required. In fact we should clean up the existing methods which hard code “implicit” router. I opened SOLR-6073 for it. # There are some unrelated changes in CollectionHandler.handleRequestStatus() # The added synchronisation in CoreAdminHandler.addTask is required. In fact it is a bug with the async work you did earlier and it should be fixed in trunk/branch_4x asap. We’re probably late for it to make into 4.8.1 but we should still try for it. # The DistributedMap.size() method needlessly fetches all children. It can be implemented more efficiently using: Stat stat = new Stat(); zookeeper.getData(dir, null, stat, true); stat.getNumChildren(); # The ‘excludeList’ param in DistributedQueue.peekTopN should be named ‘excludeSet’. # DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead. Otherwise the method will return null even if children were found in the second pass after waiting. {code} if (waitedEnough) { if (topN.isEmpty()) return null; } {code} # DistributedQueue.peekTopN has the following. Here the counter should be incremented only after topN.add(queueEvent) otherwise it either returns less nodes than requested and available or it waits more than required. For example, suppose children are (1,2,3,4,5), n=2 and excludeList=(1,2) then an extra await is invoked or if excludeList=(1,3) then only 2 is returned. In fact I think we should remove counter and just use topN.size() in the if condition. Also, is there any chance that headNode may be null? {code} for (String headNode : orderedChildren.values()) { if (headNode != null && counter++ < n) { try { String id = dir + "/" + headNode; if (excludeList != null && excludeList.contains(id)) continue; QueueEvent queueEvent = new QueueEvent(id, zookeeper.getData(dir + "/" + headNode, null, null, true), null); topN.add(queueEvent); } catch (KeeperException.NoNodeException e) { // Another client removed the node first, try next } } else { if (topN.size() >= 1) { return topN; } } } if (topN.size() >= 1) { return topN; } else { childWatcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait); waitedEnough = wait != Long.MAX_VALUE; continue; } {code} # The DistributedQueue.peekTopN method catches and swallows the InterruptedException. We should just declare that it throws InterruptedException and let the caller deal with it. # Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId() # Do not swallow InterruptedException in DistributedQueue.getLastElementId() # overseerCollectionProcessor.shutdown(); in Overseer.close() is not required because that is done by ccThread.close() already # There are formatting errors in success, error, time and storeFailureDetails methods in Overseer.Stats # If the maxParallelThreads is supposed to be a constant then it should renamed accordingly as MAX_PARALLEL_THREADS. # The maxParallelThreads=10 is not actually used while creating the ThreadPoolExecutor. Instead it is initialised with 5-100 threads! # Use this.processedZKTasks = Collections.synchronizedSet(new HashSet<String>()); to remove the unchecked cast warning in OCP constructor. # Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory? # Remove the e.printStackTrace in catch clauses in OCP.run() # Do not swallow InterruptedException in OCP.run() # In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required # What is the reason behind cleaning work queue twice and sleeping for 20ms in this code: {code} cleanUpWorkQueue(); while(runningTasks.size() > maxParallelThreads) { Thread.sleep(20); } cleanUpWorkQueue(); {code} # There are unrelated changes in OCP.prioritizeOverseerNodes # There are formatting problems in run(), checkExclusivity and cleanUpWorkQueue methods in OCP. # We should check for asyncId != null in if (completedMap.contains(asyncId) || failureMap.contains(asyncId)) to avoid two unnecessary calls to ZK. # KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored - Why would that happen? If it happens, why is it okay to ignore it? Shouldn’t we fail loudly or log a warning? > Make the OverseerCollectionProcessor multi-threaded > --------------------------------------------------- > > Key: SOLR-5681 > URL: https://issues.apache.org/jira/browse/SOLR-5681 > Project: Solr > Issue Type: Improvement > Components: SolrCloud > Reporter: Anshum Gupta > Assignee: Anshum Gupta > Attachments: SOLR-5681-2.patch, SOLR-5681-2.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch > > > Right now, the OverseerCollectionProcessor is single threaded i.e submitting > anything long running would have it block processing of other mutually > exclusive tasks. > When OCP tasks become optionally async (SOLR-5477), it'd be good to have > truly non-blocking behavior by multi-threading the OCP itself. > For example, a ShardSplit call on Collection1 would block the thread and > thereby, not processing a create collection task (which would stay queued in > zk) though both the tasks are mutually exclusive. > Here are a few of the challenges: > * Mutual exclusivity: Only let mutually exclusive tasks run in parallel. An > easy way to handle that is to only let 1 task per collection run at a time. > * ZK Distributed Queue to feed tasks: The OCP consumes tasks from a queue. > The task from the workQueue is only removed on completion so that in case of > a failure, the new Overseer can re-consume the same task and retry. A queue > is not the right data structure in the first place to look ahead i.e. get the > 2nd task from the queue when the 1st one is in process. Also, deleting tasks > which are not at the head of a queue is not really an 'intuitive' thing. > Proposed solutions for task management: > * Task funnel and peekAfter(): The parent thread is responsible for getting > and passing the request to a new thread (or one from the pool). The parent > method uses a peekAfter(last element) instead of a peek(). The peekAfter > returns the task after the 'last element'. Maintain this request information > and use it for deleting/cleaning up the workQueue. > * Another (almost duplicate) queue: While offering tasks to workQueue, also > offer them to a new queue (call it volatileWorkQueue?). The difference is, as > soon as a task from this is picked up for processing by the thread, it's > removed from the queue. At the end, the cleanup is done from the workQueue. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org