Re: [PR] fixes slow bulk import with many tablets and file [accumulo]

via GitHub Fri, 08 Nov 2024 04:52:38 -0800


dlmarion commented on code in PR #5044:
URL: https://github.com/apache/accumulo/pull/5044#discussion_r1834358686



##########
server/manager/src/main/java/org/apache/accumulo/manager/tableOps/bulkVer2/LoadFiles.java:
##########
@@ -338,11 +338,15 @@ private long loadFiles(TableId tableId, Path bulkDir, 
LoadMappingIterator loadMa
         
TabletsMetadata.builder(manager.getContext()).forTable(tableId).overlapping(startRow,
 null)
             .checkConsistency().fetch(PREV_ROW, LOCATION, LOADED, 
TIME).build()) {
 
+      // The tablet iterator and load mapping iterator are both iterating over 
data that is sorted
+      // in the same way. The two iterators are each independently advanced to 
find common points in
+      // the sorted data.

Review Comment:
   `BulkSerialize` writes a file based on a sorted map. But, when it's read 
back in it's assuming that the file contents are still sorted. I'm just 
wondering if there is a test or something to validate the assumption that both 
inputs are sorted the same way. Is there a possibility that an admin could 
"fix" the bulk import file under some failure condition that would cause it to 
not be sorted? Should we re-sort after reading it back in?



##########
test/src/main/java/org/apache/accumulo/test/functional/BulkNewIT.java:
##########
@@ -829,6 +831,55 @@ public void testAvailability() throws Exception {
     }
   }
 
+  @Test
+  public void testManyTabletAndFiles() throws Exception {
+    try (AccumuloClient c = 
Accumulo.newClient().from(getClientProps()).build()) {
+      String dir = getDir("/testBulkFile-");
+      FileSystem fs = getCluster().getFileSystem();
+      fs.mkdirs(new Path(dir));
+
+      TreeSet<Text> splits = IntStream.range(1, 
9000).mapToObj(BulkNewIT::row).map(Text::new)
+          .collect(Collectors.toCollection(TreeSet::new));
+      c.tableOperations().addSplits(tableName, splits);
+
+      var executor = Executors.newFixedThreadPool(16);
+      var futures = new ArrayList<Future<?>>();
+
+      var loadPlanBuilder = LoadPlan.builder();
+      var rowsExpected = new HashSet<>();
+      for (int i = 2; i < 8999; i++) {
+        int data = i;
+        String filename = "f" + data + ".";
+        loadPlanBuilder.loadFileTo(filename + RFile.EXTENSION, 
RangeType.TABLE, row(data - 1),
+            row(data));
+        var future = executor.submit(() -> {
+          writeData(dir + "/" + filename, aconf, data, data);

Review Comment:
   Does the order of the lines in the file matter? I wonder if it would be good 
to write them in reverse (8999 -> 2), or put the numbers in a set and randomize 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fixes slow bulk import with many tablets and file [accumulo]

Reply via email to