Re: [PR] feat: preliminary support for distributed task scheduler [incubator-hugegraph]

via GitHub Sun, 15 Oct 2023 05:29:03 -0700


VGalaxies commented on code in PR #2319:
URL: 
https://github.com/apache/incubator-hugegraph/pull/2319#discussion_r1359868973



##########
hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java:
##########
@@ -0,0 +1,666 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership. The ASF
+ * licenses this file to You under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hugegraph.task;
+
+import java.util.Collection;
+import java.util.Iterator;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.ScheduledThreadPoolExecutor;
+import java.util.concurrent.ThreadPoolExecutor;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.locks.Lock;
+
+import org.apache.hugegraph.HugeException;
+import org.apache.hugegraph.HugeGraph;
+import org.apache.hugegraph.HugeGraphParams;
+import org.apache.hugegraph.backend.id.Id;
+import org.apache.hugegraph.backend.page.PageInfo;
+import org.apache.hugegraph.backend.query.QueryResults;
+import org.apache.hugegraph.concurrent.LockGroup;
+import org.apache.hugegraph.concurrent.LockManager;
+import org.apache.hugegraph.config.CoreOptions;
+import org.apache.hugegraph.exception.ConnectionException;
+import org.apache.hugegraph.exception.NotFoundException;
+import org.apache.hugegraph.meta.MetaManager;
+import org.apache.hugegraph.meta.lock.LockResult;
+import org.apache.hugegraph.structure.HugeVertex;
+import org.apache.hugegraph.util.E;
+import org.apache.hugegraph.util.LockUtil;
+import org.apache.hugegraph.util.Log;
+import org.apache.tinkerpop.gremlin.structure.Vertex;
+import org.slf4j.Logger;
+
+public class DistributedTaskScheduler extends TaskAndResultScheduler {
+    protected static final int SCHEDULE_PERIOD = 10;
+    private static final Logger LOG = 
Log.logger(DistributedTaskScheduler.class);
+    private final ExecutorService taskDbExecutor;
+    private final ExecutorService schemaTaskExecutor;
+    private final ExecutorService olapTaskExecutor;
+    private final ExecutorService ephemeralTaskExecutor;
+    private final ExecutorService gremlinTaskExecutor;
+    private final ScheduledThreadPoolExecutor schedulerExecutor;
+    private final ScheduledFuture<?> cronFuture;
+
+    private final String lockGroupName;
+
+    /**
+     * the status of scheduler
+     */
+    private final AtomicBoolean closed = new AtomicBoolean(true);
+
+    private final ConcurrentHashMap<Id, HugeTask<?>> runningTasks = new 
ConcurrentHashMap<>();
+
+    public DistributedTaskScheduler(HugeGraphParams graph,
+                                    ScheduledThreadPoolExecutor 
schedulerExecutor,
+                                    ExecutorService taskDbExecutor,
+                                    ExecutorService schemaTaskExecutor,
+                                    ExecutorService olapTaskExecutor,
+                                    ExecutorService gremlinTaskExecutor,
+                                    ExecutorService ephemeralTaskExecutor,
+                                    ExecutorService serverInfoDbExecutor) {
+        super(graph, serverInfoDbExecutor);
+
+        this.taskDbExecutor = taskDbExecutor;
+        this.schemaTaskExecutor = schemaTaskExecutor;
+        this.olapTaskExecutor = olapTaskExecutor;
+        this.gremlinTaskExecutor = gremlinTaskExecutor;
+        this.ephemeralTaskExecutor = ephemeralTaskExecutor;
+
+        this.schedulerExecutor = schedulerExecutor;
+
+        lockGroupName = String.format("%s_%s_distributed", graphSpace, graph);
+        LockManager.instance().create(lockGroupName);
+
+        this.closed.set(false);
+
+        this.cronFuture = this.schedulerExecutor.scheduleWithFixedDelay(
+            () -> {
+                // TODO: uncomment later - graph space
+                // LockUtil.lock(this.graph().spaceGraphName(), 
LockUtil.GRAPH_LOCK);
+                LockUtil.lock("", LockUtil.GRAPH_LOCK);
+                try {
+                    // TODO: 使用超级管理员权限，查询任务
+                    // TaskManager.useAdmin();
+                    this.cronSchedule();
+                } catch (Throwable t) {
+                    LOG.info("cronScheduler exception ", t);
+                } finally {
+                    // TODO: uncomment later - graph space
+                    LockUtil.unlock("", LockUtil.GRAPH_LOCK);
+                    // LockUtil.unlock(this.graph().spaceGraphName(), 
LockUtil.GRAPH_LOCK);
+                }
+            },
+            10L, SCHEDULE_PERIOD,
+            TimeUnit.SECONDS);
+    }
+
+    private static boolean sleep(long ms) {
+        try {
+            Thread.sleep(ms);
+            return true;
+        } catch (InterruptedException ignored) {
+            // Ignore InterruptedException
+            return false;
+        }
+    }
+
+    public void cronSchedule() {

Review Comment:
   > 
这里个人有个想法可以讨论下哈：这段代码看着应该是每隔一段时间，查询所有的任务，然后过滤出没有被锁的任务，最后驱动这些没有被锁的任务的状态，我理解没错的话，那么其实所有节点都会查询一次所有的任务，再上层会做filter，这里看着似乎可以将这个逻辑下层到store，store
 层每次返回给server的节点的批量数据都需要先锁定，然后在返回给server, 如果已经被锁定了，就在store 层过滤掉
   
   我的理解是，这里过滤出没有被锁 (无节点在执行) 的任务主要出于三个目的
   
   1. 将处于 RUNNING 状态且无节点在执行的任务状态转移到 FAILED
   2. 将处于 CANCELLING 状态且无节点在执行的任务状态转移到 CANCELLED
   3. 将处于 DELETING 状态且无节点在执行的任务删除
   
   如果在 store 层过滤掉已经被锁定的任务，server 层就无法对这些任务进行相应的处理了 (例如对处于 RUNNING 
状态的任务初始化环境变量，此时这些任务应当是被锁定的)



##########
hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java:
##########
@@ -0,0 +1,666 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership. The ASF
+ * licenses this file to You under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hugegraph.task;
+
+import java.util.Collection;
+import java.util.Iterator;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.ScheduledThreadPoolExecutor;
+import java.util.concurrent.ThreadPoolExecutor;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.locks.Lock;
+
+import org.apache.hugegraph.HugeException;
+import org.apache.hugegraph.HugeGraph;
+import org.apache.hugegraph.HugeGraphParams;
+import org.apache.hugegraph.backend.id.Id;
+import org.apache.hugegraph.backend.page.PageInfo;
+import org.apache.hugegraph.backend.query.QueryResults;
+import org.apache.hugegraph.concurrent.LockGroup;
+import org.apache.hugegraph.concurrent.LockManager;
+import org.apache.hugegraph.config.CoreOptions;
+import org.apache.hugegraph.exception.ConnectionException;
+import org.apache.hugegraph.exception.NotFoundException;
+import org.apache.hugegraph.meta.MetaManager;
+import org.apache.hugegraph.meta.lock.LockResult;
+import org.apache.hugegraph.structure.HugeVertex;
+import org.apache.hugegraph.util.E;
+import org.apache.hugegraph.util.LockUtil;
+import org.apache.hugegraph.util.Log;
+import org.apache.tinkerpop.gremlin.structure.Vertex;
+import org.slf4j.Logger;
+
+public class DistributedTaskScheduler extends TaskAndResultScheduler {
+    protected static final int SCHEDULE_PERIOD = 10;
+    private static final Logger LOG = 
Log.logger(DistributedTaskScheduler.class);
+    private final ExecutorService taskDbExecutor;
+    private final ExecutorService schemaTaskExecutor;
+    private final ExecutorService olapTaskExecutor;
+    private final ExecutorService ephemeralTaskExecutor;
+    private final ExecutorService gremlinTaskExecutor;
+    private final ScheduledThreadPoolExecutor schedulerExecutor;
+    private final ScheduledFuture<?> cronFuture;
+
+    private final String lockGroupName;
+
+    /**
+     * the status of scheduler
+     */
+    private final AtomicBoolean closed = new AtomicBoolean(true);
+
+    private final ConcurrentHashMap<Id, HugeTask<?>> runningTasks = new 
ConcurrentHashMap<>();
+
+    public DistributedTaskScheduler(HugeGraphParams graph,
+                                    ScheduledThreadPoolExecutor 
schedulerExecutor,
+                                    ExecutorService taskDbExecutor,
+                                    ExecutorService schemaTaskExecutor,
+                                    ExecutorService olapTaskExecutor,
+                                    ExecutorService gremlinTaskExecutor,
+                                    ExecutorService ephemeralTaskExecutor,
+                                    ExecutorService serverInfoDbExecutor) {
+        super(graph, serverInfoDbExecutor);
+
+        this.taskDbExecutor = taskDbExecutor;
+        this.schemaTaskExecutor = schemaTaskExecutor;
+        this.olapTaskExecutor = olapTaskExecutor;
+        this.gremlinTaskExecutor = gremlinTaskExecutor;
+        this.ephemeralTaskExecutor = ephemeralTaskExecutor;
+
+        this.schedulerExecutor = schedulerExecutor;
+
+        lockGroupName = String.format("%s_%s_distributed", graphSpace, graph);
+        LockManager.instance().create(lockGroupName);
+
+        this.closed.set(false);
+
+        this.cronFuture = this.schedulerExecutor.scheduleWithFixedDelay(
+            () -> {
+                // TODO: uncomment later - graph space
+                // LockUtil.lock(this.graph().spaceGraphName(), 
LockUtil.GRAPH_LOCK);
+                LockUtil.lock("", LockUtil.GRAPH_LOCK);
+                try {
+                    // TODO: 使用超级管理员权限，查询任务
+                    // TaskManager.useAdmin();
+                    this.cronSchedule();
+                } catch (Throwable t) {
+                    LOG.info("cronScheduler exception ", t);
+                } finally {
+                    // TODO: uncomment later - graph space
+                    LockUtil.unlock("", LockUtil.GRAPH_LOCK);
+                    // LockUtil.unlock(this.graph().spaceGraphName(), 
LockUtil.GRAPH_LOCK);
+                }
+            },
+            10L, SCHEDULE_PERIOD,
+            TimeUnit.SECONDS);
+    }
+
+    private static boolean sleep(long ms) {
+        try {
+            Thread.sleep(ms);
+            return true;
+        } catch (InterruptedException ignored) {
+            // Ignore InterruptedException
+            return false;
+        }
+    }
+
+    public void cronSchedule() {

Review Comment:
   > 
这里个人有个想法可以讨论下哈：这段代码看着应该是每隔一段时间，查询所有的任务，然后过滤出没有被锁的任务，最后驱动这些没有被锁的任务的状态，我理解没错的话，那么其实所有节点都会查询一次所有的任务，再上层会做filter，这里看着似乎可以将这个逻辑下层到store，store
 层每次返回给server的节点的批量数据都需要先锁定，然后在返回给server, 如果已经被锁定了，就在store 层过滤掉
   
   我的理解是，这里过滤出没有被锁 (无节点在执行) 的任务主要出于三个目的
   
   1. 将处于 RUNNING 状态且无节点在执行的任务状态转移到 FAILED
   2. 将处于 CANCELLING 状态且无节点在执行的任务状态转移到 CANCELLED
   3. 将处于 DELETING 状态且无节点在执行的任务删除
   
   如果在 store 层过滤掉已经被锁定的任务，server 层就无法对这些任务进行相应的处理了 (例如对处于 RUNNING 
状态的任务初始化环境变量，此时这些任务应当是被锁定的)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: preliminary support for distributed task scheduler [incubator-hugegraph]

Reply via email to