keith-turner opened a new pull request, #4898:
URL: https://github.com/apache/accumulo/pull/4898
This is prototype for a few new APIs that allow distributing the examination
of files for bulk import v2. Any feedback would be appreciated.
For a given bulk import directory with N files this would support a use case
like the following.
1. For eack file a task is spun up on a remote server that calls the new
LoadPlan.compute() API to determine what tablets the file overlaps. Then the
new LoadPlan.toJson() method is called to serialize the load plan and send it
to a central place.
2. All the load plans from the remote servers are deserialized calling the
new LoadPlan.fromJson() method and merged into a single load plan that is used
to do the bulk import.
Another use case these new APIs could support is running this new code in
the map reduce job that generates bulk import data.
1. In each reducer after it produces an rfile it could then call the new
LoadPlan.compute(), then call LoadPlan.toJson() and save the result to a file.
So after the map reduce job completes each rfile would have corresponding file
with a load plan for that file.
2. Another process that runs after the map reduce job can load all the
load plans from files and merge them using the new LoadPlan.fromJson() method.
Then the merged LoadPlan can be used to do the bulk import.
BulkNewIT.testComputeLoadPlan() simulates this map reduce use case by going
through the steps in code that a map reduce job would. This tests the new APIs
and shows what using it would look like.
Both of these use cases avoid doing the analysis of files on a single
machine doing the bulk import. Bulk import V1 had this functionality and would
ask random tservers to do the file analysis. This could cause unexpected load
on those tservers. Bulk V1 would interleave analyzing files and adding them to
tablets. This could lead to odd situations where files are partially imported
to some tablets and analysis fails, leaving the file partially imported. Bulk
v2 does all analysis before any files are added to tablets, however it lacks
this distributed analysis capability. This is an initial attempt to offer that
functionality in bulk v2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]