Re: [I] [Question] 十亿点边如何进行快速导入 [incubator-hugegraph]

via GitHub Wed, 31 Jul 2024 00:57:49 -0700


dosubot[bot] commented on issue #2607:
URL: 
https://github.com/apache/incubator-hugegraph/issues/2607#issuecomment-2259900760


   <!-- Answer -->
   要高效地导入每天约10亿条增量数据到HugeGraph并避免出现“rest server is too busy to 
write”的错误，可以参考以下调优建议：
   
   1. 
**使用批量加载**：使用HBase的批量加载功能，而不是直接写入HBase。这涉及生成HFiles并将其加载到HBase中。以下是生成HFiles并加载到HBase的代码示例：
   
       ```java
       JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
           buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
       Configuration conf = sinkToHBase.getHBaseConfiguration().get();
       Job job = Job.getInstance(conf);
       HFileOutputFormat2.configureIncrementalLoadMap(job, tableDescriptor);
       conf.set("hbase.mapreduce.hfileoutputformat.table.name",
                tableDescriptor.getTableName().getNameAsString());
       String path = getHFilePath(job.getConfiguration());
       repartitionedRdd.saveAsNewAPIHadoopFile(path, 
ImmutableBytesWritable.class,
                                               KeyValue.class, 
HFileOutputFormat2.class,
                                               conf);
       LOG.info("Saved HFiles to: '{}'", path);
       flushPermission(conf, path);
       ```
   
   2. **重新分区和排序数据**：在生成HFiles之前，重新分区和排序数据以确保其分布均匀且有序。这有助于减少HBase区域服务器的负载。
   
       ```java
       JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
           buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
       ```
   
   3. **调整HBase配置**：调整HBase配置以处理大数据量。关键参数包括：
       - `hbase.regionserver.handler.count`：增加处理程序的数量以允许更多并发操作。
       - `hbase.regionserver.global.memstore.upperLimit`：增加memstore大小以缓冲更多写入操作。
       - `hbase.hstore.blockingStoreFiles`：增加在触发压缩之前允许的存储文件数量。
   
   4. **监控和扩展HBase集群**：确保HBase集群适当扩展以处理负载。监控集群性能并在必要时添加更多区域服务器。
   
   5. **优化HFile生成路径**：确保HFile生成路径已优化并在使用前清理。
   
       ```java
       public String getHFilePath(Configuration conf) throws IOException {
           FileSystem fs = FileSystem.get(conf);
           String fileID = fileID();
           String pathStr = fs.getWorkingDirectory().toString() + "/hfile-gen" 
+ "/" + fileID + "/";
           Path hfileGenPath = new Path(pathStr);
           if (fs.exists(hfileGenPath)) {
               LOG.info("\n Delete the path where the hfile is generated,path 
{} ", pathStr);
               fs.delete(hfileGenPath, true);
           }
           return pathStr;
       }
       ```
   
   6. **批量加载HFiles**：最后，使用批量加载功能将生成的HFiles加载到HBase中。
   
       ```java
       public void loadFiles(String path) {
           try {
               // BulkLoad HFile to HBase
               sinkToHBase.loadHfiles(path, getTableName());
           } catch (Exception e) {
               LOG.error(" Failed to load hfiles", e);
           }
       }
       ```
   
   通过遵循这些实践，可以高效地将大数据量导入HBase，并最大限度地减少出现“rest server is too busy to 
write”错误的风险<sup>[[1]](https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/src/main/java/org/apache/hugegraph/loader/direct/loader/HBaseDirectLoader.java)</sup>。
   
   
   <!-- Dosu Comment Footer -->
   *To continue talking to [Dosu](https://dosu.dev), mention @dosu.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Question] 十亿点边如何进行快速导入 [incubator-hugegraph]

Reply via email to