[iotdb] branch rel/1.0 updated: add faq (#8586)

haonan Thu, 22 Dec 2022 03:04:39 -0800

This is an automated email from the ASF dual-hosted git repository.

haonan pushed a commit to branch rel/1.0
in repository https://gitbox.apache.org/repos/asf/iotdb.git



The following commit(s) were added to refs/heads/rel/1.0 by this push:
     new 93d0ccb7a4 add faq (#8586)
93d0ccb7a4 is described below

commit 93d0ccb7a43eb90e624462e1e3f0c54b8428fa29
Author: Beyyes <[email protected]>
AuthorDate: Thu Dec 22 19:04:28 2022 +0800

    add faq (#8586)
---
 docs/UserGuide/FAQ/FAQ-for-cluster-setup.md    | 99 ++++++++++++++++++++++++++
 docs/zh/UserGuide/FAQ/FAQ-for-cluster-setup.md | 99 ++++++++++++++++++++++++++
 2 files changed, 198 insertions(+)

diff --git a/docs/UserGuide/FAQ/FAQ-for-cluster-setup.md 
b/docs/UserGuide/FAQ/FAQ-for-cluster-setup.md
new file mode 100644
index 0000000000..b28c062c73
--- /dev/null
+++ b/docs/UserGuide/FAQ/FAQ-for-cluster-setup.md
@@ -0,0 +1,99 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+<!-- TOC -->
+
+# FAQ for Cluster Setup
+
+## 1. Cluster StartUp and Stop
+
+### 1). Failed to start ConfigNode for the first time, how to find the reason?
+
+- Make sure that the data/confignode directory is cleared when start 
ConfigNode for the first time.
+- Make sure that the <IP+Port> used by ConfigNode is not occupied, and the 
<IP+Port> is also not conflicted with other ConfigNodes.
+- Make sure that the `cn_target_confignode_list` is configured correctly, 
which points to the alive ConfigNode. And if the ConfigNode is started for the 
first time, make sure that `cn_target_confignode_list` points to itself.
+- Make sure that the configuration(consensus protocol and replica number) of 
the started ConfigNode is accord with the `cn_target_confignode_list` 
ConfigNode.
+
+### 2). ConfigNode is started successfully, but why the node doesn't appear in 
the results of `show cluster`?
+
+- Examine whether the `cn_target_confignode_list` points to the correct 
address. If `cn_target_confignode_list` points to itself, a new ConfigNode 
cluster is started.
+
+### 3). Failed to start DataNode for the first time, how to find the reason?
+
+- Make sure that the data/datanode directory is cleared when start DataNode 
for the first time. If the start result is “Reject DataNode restart.”, maybe 
the data/datanode directory is not cleared.
+- Make sure that the <IP+Port> used by DataNode is not occupied, and the 
<IP+Port> is also not conflicted with other DataNodes. 
+- Make sure that the `dn_target_confignode_list` points to the alive 
ConfigNode.
+
+### 4). Failed to remove DataNode, how to find the reason?
+
+- Examine whether the parameter of remove-datanode.sh is correct, only 
rpcIp:rpcPort and dataNodeId are correct parameter.
+- Only when the number of available DataNodes in the cluster is greater than 
max(schema_replication_factor, data_replication_factor), removing operation can 
be executed.
+- Removing DataNode will migrate the data from the removing DataNode to other 
alive DataNodes. Data migration is based on Region, if some regions are 
migrated failed, the removing DataNode will always in the status of `Removing`.
+- If the DataNode is in the status of `Removing`, the regions in the removing 
DataNode will also in the status of `Removing` or `Unknown`, which are 
unavailable status. Besides, the removing DataNode will not receive new write 
requests from client. 
+And users can use the command `set system status to running` to make the 
status of DataNode from Removing to Running;
+If users want to make the Regions from Removing to available status, command 
`migrate region from datanodeId1 to datanodeId2` can take effect, this command 
can migrate the regions to other alive DataNodes.
+Besides, IoTDB will publish `remove-datanode.sh -f` command in the next 
version, which can remove DataNodes forced (The failed migrated regions will be 
discarded).
+
+### 5). Whether the down DataNode can be removed?
+
+- The down DataNode can be removed only when the replica factor of schema and 
data is greater than 1.  
+Besides, IoTDB will publish `remove-datanode.sh -f` function in the next 
version.
+
+### 6).What should be paid attention to when upgrading from 0.13 to 1.0?
+
+- The file structure between 0.13 and 1.0 is different, we can't copy the data 
directory from 0.13 to 1.0 to use directly. 
+If you want to load the data from 0.13 to 1.0, you can use the LOAD function.
+- The default RPC address of 0.13 is `0.0.0.0`, but the default RPC address of 
1.0 is `127.0.0.1`.
+
+
+## 2. Cluster Restart
+
+### 1). How to restart any ConfigNode in the cluster?
+- First step: stop the process by stop-confignode.sh or kill PID of 
ConfigNode. 
+- Second step: execute start-confignode.sh to restart ConfigNode.
+
+### 2). How to restart any DataNode in the cluster?
+- First step: stop the process by stop-datanode.sh or kill PID of DataNode.
+- Second step: execute start-datanode.sh to restart DataNode.
+
+### 3). If it's possible to restart ConfigNode using the old data directory 
when it's removed?
+- Can't. The running result will be "Reject ConfigNode restart. Because there 
are no corresponding ConfigNode(whose nodeId=xx) in the cluster".
+
+### 4). If it's possible to restart DataNode using the old data directory when 
it's removed?
+- Can't. The running result will be "Reject DataNode restart. Because there 
are no corresponding DataNode(whose nodeId=xx) in the cluster. Possible 
solutions are as follows:...".
+
+### 5). Can we execute start-confignode.sh/start-datanode.sh successfully when 
delete the data directory of given ConfigNode/DataNode without killing the PID?
+- Can't. The running result will be "The port is already occupied".
+
+## 3. Cluster Maintenance
+
+### 1). How to find the reason when Show cluster failed, and error logs like 
"please check server status" are shown?
+- Make sure that more than one half ConfigNodes are alive.
+- Make sure that the DataNode connected by the client is alive.
+
+### 2). How to fix one DataNode when the disk file is broken?
+- We can use remove-datanode.sh to fix it. Remove-datanode will migrate the 
data in the removing DataNode to other alive DataNodes.
+- IoTDB will publish Node-Fix tools in the next version.
+
+### 3). How to decrease the memory usage of ConfigNode/DataNode?
+- Adjust the MAX_HEAP_SIZE、MAX_DIRECT_MEMORY_SIZE options in 
conf/confignode-env.sh and conf/datanode-env.sh.
+
+
diff --git a/docs/zh/UserGuide/FAQ/FAQ-for-cluster-setup.md 
b/docs/zh/UserGuide/FAQ/FAQ-for-cluster-setup.md
new file mode 100644
index 0000000000..066741a147
--- /dev/null
+++ b/docs/zh/UserGuide/FAQ/FAQ-for-cluster-setup.md
@@ -0,0 +1,99 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+<!-- TOC -->
+
+# 分布式部署 FAQ
+
+## 一、集群启停
+
+### 1. ConfigNode初次启动失败，如何排查原因？
+
+- ConfigNode初次启动时确保已清空data/confignode目录
+- 确保该ConfigNode使用到的<IP+端口>没有被占用，没有与已启动的ConfigNode使用到的<IP+端口>冲突 
+- 
确保该ConfigNode的cn_target_confignode_list（指向存活的ConfigNode；如果该ConfigNode是启动的第一个ConfigNode，该值指向自身）配置正确
 
+- 确保该ConfigNode的配置项（共识协议、副本数等）等与cn_target_confignode_list对应的ConfigNode集群一致
+
+### 2. ConfigNode初次启动成功，show cluster的结果里为何没有该节点？
+
+- 检查cn_target_confignode_list是否正确指向了正确的地址； 
如果cn_target_confignode_list指向了自身，则会启动一个新的ConfigNode集群
+
+### 3. DataNode初次启动失败，如何排查原因？
+
+- DataNode初次启动时确保已清空data/datanode目录。 如果启动结果为“Reject DataNode 
restart.”则表示启动时可能没有清空data/datanode目录
+- 确保该DataNode使用到的<IP+端口>没有被占用，没有与已启动的DataNode使用到的<IP+端口>冲突
+- 确保该DataNode的dn_target_confignode_list指向存活的ConfigNode
+
+### 4. 移除DataNode执行失败，如何排查？
+
+- 检查remove-datanode脚本的参数是否正确，是否传入了正确的ip:port或正确的dataNodeId
+- 只有集群可用节点数量 > max(元数据副本数量, 数据副本数量)时，移除操作才允许被执行
+- 
执行移除DataNode的过程会将该DataNode上的数据迁移到其他存活的DataNode，数据迁移以Region为粒度，如果某个Region迁移失败，则被移除的DataNode会一直处于Removing状态
+- 补充：处于Removing状态的节点，其节点上的Region也是Removing或Unknown状态，即不可用状态。 
该Remvoing状态的节点也不会接受客户端的请求。 
+如果要使Removing状态的节点变为可用，用户可以使用set system status to running 命令将该节点设置为Running状态； 
+如果要使迁移失败的Region处于可用状态，可以使用migrate region from datanodeId1 to datanodeId2 
命令将该不可用的Region迁移到其他存活的节点。 
+另外IoTDB后续也会提供remove-datanode.sh -f命令，来强制移除节点（迁移失败的Region会直接丢弃）
+
+### 5. 挂掉的DataNode是否支持移除？
+
+- 当前集群副本数量大于1时可以移除。 如果集群副本数量等于1，则不支持移除。 在下个版本会推出强制移除的命令
+
+### 6. 从0.13升级到1.0需要注意什么？
+
+- 
0.13版本与1.0版本的文件目录结构是不同的，不能将0.13的data目录直接拷贝到1.0集群使用。如果需要将0.13的数据导入至1.0，可以使用LOAD功能
+- 0.13版本的默认RPC地址是0.0.0.0，1.0版本的默认RPC地址是127.0.0.1
+
+
+## 二、集群重启
+
+### 1. 如何重启集群中的某个ConfigNode？
+- 第一步：通过stop-confignode.sh或kill进程方式关闭ConfigNode进程
+- 第二步：通过执行start-confignode.sh启动ConfigNode进程实现重启
+- 下个版本IoTDB会提供一键重启的操作
+
+### 2. 如何重启集群中的某个DataNode？
+- 第一步：通过stop-datanode.sh或kill进程方式关闭DataNode进程
+- 第二步：通过执行start-datanode.sh启动DataNode进程实现重启
+- 下个版本IoTDB会提供一键重启的操作
+
+### 3. 将某个ConfigNode移除后（remove-confignode），能否再利用该ConfigNode的data目录重启？
+- 不能。会报错：Reject ConfigNode restart. Because there are no corresponding 
ConfigNode(whose nodeId=xx) in the cluster.
+
+### 4. 将某个DataNode移除后（remove-datanode），能否再利用该DataNode的data目录重启？
+- 不能正常重启，启动结果为“Reject DataNode restart. Because there are no corresponding 
DataNode(whose nodeId=xx) in the cluster. Possible solutions are as follows:...”
+
+### 5. 
用户看到某个ConfigNode/DataNode变成了Unknown状态，在没有kill对应进程的情况下，直接删除掉ConfigNode/DataNode对应的data目录，然后执行start-confignode.sh/start-datanode.sh，这种情况下能成功吗?
+- 无法启动成功，会报错端口已被占用
+
+## 三、集群运维
+
+### 1. Show cluster执行失败，显示“please check server status”，如何排查?
+- 确保ConfigNode集群一半以上的节点处于存活状态
+- 确保客户端连接的DataNode处于存活状态
+
+### 2. 某一DataNode节点的磁盘文件损坏，如何修复这个节点?
+- 
当前只能通过remove-datanode的方式进行实现。remove-datanode执行的过程中会将该DataNode上的数据迁移至其他存活的DataNode节点（前提是集群设置的副本数大于1）
+- 下个版本IoTDB会提供一键修复节点的功能
+
+### 3. 如何降低ConfigNode、DataNode使用的内存？
+- 
在conf/confignode-env.sh、conf/datanode-env.sh文件可通过调整MAX_HEAP_SIZE、MAX_DIRECT_MEMORY_SIZE等选项可以调整ConfigNode、DataNode使用的最大堆内、堆外内存
+
+

[iotdb] branch rel/1.0 updated: add faq (#8586)

Reply via email to