[PR] prototype of bulk import v2 distributed file examination [accumulo]

via GitHub Tue, 17 Sep 2024 15:23:02 -0700


keith-turner opened a new pull request, #4898:
URL: https://github.com/apache/accumulo/pull/4898


   This is prototype for a few new APIs that allow distributing the examination 
of files for bulk import v2.  Any feedback would be appreciated.
   
   For a given bulk import directory with N files this would support a use case 
like the following.
   
    1. For eack file a task is spun up on a remote server that calls the new 
LoadPlan.compute() API to determine what tablets the file overlaps. Then the 
new LoadPlan.toJson() method is called to serialize the load plan and send it 
to a central place.
    2. All the load plans from the remote servers are deserialized calling the 
new LoadPlan.fromJson() method and merged into a single load plan that is used 
to do the bulk import.
   
   Another use case these new APIs could support is running this new code in 
the map reduce job that generates bulk import data.
   
     1. In each reducer after it produces an rfile it could then call the new 
LoadPlan.compute(), then call LoadPlan.toJson() and save the result to a file.  
So after the map reduce job completes each rfile would have  corresponding file 
with a load plan for that file.
     2. Another process that runs after the map reduce job can load all the 
load plans from files and merge them using the new LoadPlan.fromJson() method.  
Then the merged LoadPlan can be used to do the bulk import.
   
   BulkNewIT.testComputeLoadPlan() simulates this map reduce use case by going 
through the steps in code that a map reduce job would.  This tests the new APIs 
and shows what using it would look like.
   
   Both of these use cases avoid doing the analysis of files on a single 
machine doing the bulk import.  Bulk import V1 had this functionality and would 
ask random tservers to do the file analysis.  This could cause unexpected load 
on those tservers.  Bulk V1 would interleave analyzing files and adding them to 
tablets.  This could lead to odd situations where files are partially imported 
to some tablets and analysis fails, leaving the file partially imported.  Bulk 
v2 does all analysis before any files are added to tablets, however it lacks 
this distributed analysis capability.  This is an initial attempt to offer that 
functionality in bulk v2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] prototype of bulk import v2 distributed file examination [accumulo]

Reply via email to