[GitHub] [accumulo] cshannon commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

via GitHub Fri, 26 May 2023 10:19:15 -0700


cshannon commented on code in PR #3424:
URL: https://github.com/apache/accumulo/pull/3424#discussion_r1207101071



##########
core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/BulkImport.java:
##########
@@ -278,33 +278,34 @@ public static Map<KeyExtent,Long> 
estimateSizes(AccumuloConfiguration acuConf,
 
     Text row = new Text();
 
-    FileSKVIterator index =
-        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, 
ns, ns.getConf(), cs)
-            
.withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build();
+    List<KeyExtent> sortedKeyExtents = new ArrayList<>(counts.keySet());
 
-    try {
+    try (FileSKVIterator index =
+        FileOperations.getInstance().newIndexReaderBuilder().forFile(dataFile, 
ns, ns.getConf(), cs)
+            
.withTableConfiguration(acuConf).withFileLenCache(fileLenCache).build()) {
       while (index.hasTop()) {
         Key key = index.getTopKey();
         totalIndexEntries++;
         key.getRow(row);
 
-        // TODO this could use a binary search
-        for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
-          if (entry.getKey().contains(row)) {
-            entry.getValue().l++;
+        int start = 0, end = sortedKeyExtents.size() - 1;

Review Comment:
   The biggest problem here is you are defeating the entire purpose by creating 
a new ArrayList. Creating a new collection copies all the entries so at that 
point you are actually just iterating over the entire collection anyways, and 
then now have to iterate again later. So this change actually makes things 
slower and use more memory than before.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon commented on a diff in pull request #3424: Improve performance of BulkImport.estimateSizes()

Reply via email to