[
https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366690#comment-16366690
]
Keith Turner commented on ACCUMULO-4813:
----------------------------------------
[~m-hogue] it would be additive. I think it would be nice to deprecate the
current bulk import process in favor of this. This could new done by adding
new APIs to do bulk import with a mapping file and deprecating the current APIs.
> Accepting mapping file for bulk import
> --------------------------------------
>
> Key: ACCUMULO-4813
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4813
> Project: Accumulo
> Issue Type: Sub-task
> Reporter: Keith Turner
> Priority: Major
> Fix For: 2.0.0
>
>
> During bulk import, inspecting files to determine where they go is expensive
> and slow. In order to spread the cost, Accumulo has an internal mechanism to
> spread the work of inspecting files to random tablet servers. Because this
> internal process takes time and consumes resources on the cluster, users want
> control over it. The best way to give this control may be to externalize it
> by allowing bulk imports to have a mapping file. This mapping file would
> specify the ranges where files should be loaded. If Accumulo provided API to
> help produce this file, then that work could be done in Map Reduce or Spark.
> This would give users all the control they want over when and where this
> computation is done. This would naturally fit in the process used to create
> the bulk files.
> To make bulk import fast this mapping file should have the following
> properties.
> * Key in file is a range
> * Value in file is a list of files
> * Ranges are non overlapping
> * File is sorted by range/key
> * Has a mapping for every non-empty file in the bulk import directory.
> If Accumulo provides APIs to do the following operation, then producing the
> file could written as a map/reduce job.
> * For a given rfile produce a list of row ranges where the file should be
> loaded. These row ranges would be based on tablets.
> * Merge row range,list of file pairs
> * Serialize row range,list of files pairs
> With a mapping file, the bulk import algorithm could be written as follows.
> This could all be executed in the master with no need to run inspection task
> on random tablet servers.
> * Sanity check file
> ** Ensure in sorted order
> ** Ensure ranges are non-overlapping
> ** Ensure each file in directory has at least one entry in file
> ** Ensure all splits in the file exist in the table.
> * Since file is sorted can do a merged read of file and metadata table,
> looping over the following operations for each tablet until all files are
> loaded.
> ** Read the loaded files for the tablet
> ** Read the files to load for the range
> ** For any files not loaded, send an async load message to the tablet server
> The above algorithm can just keep scanning the metadata table and sending
> async load messages until the bulk import is complete. Since the load
> messages are async, the bulk load could of a large number of files could
> potentially be very fast.
> The bulk load operation can easily handle the case of tablets splitting
> during the operation by matching a single range in the file to multiple
> tablets. However attempting to handle merges would be a lot more tricky. It
> would probably be simplest to fail the operation if a merge is detected. The
> nice thing is that this can be done in a very clean way. Once the bulk
> import operation has the table lock, merges can not happen. So after getting
> the table lock the bulk import operation can ensure all splits in the file
> exist in the table. The operation can abort if the condition is not met
> before doing any work. If this condition is not met, it indicates a merge
> happened between generating the mapping file an doing the bulk import.
> Hopefully the mapping file plus the algorithm that sends async load messages
> can dramatically speed up bulk import operations. This may lessen the need
> for other things like prioritizing bulk import. To measure this, it would be
> very useful create a bulk import performance test that can create many files
> with very little data and measure the time it takes load them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)