[ 
https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366690#comment-16366690
 ] 

Keith Turner commented on ACCUMULO-4813:
----------------------------------------

[~m-hogue] it would be additive. I think it would be nice to deprecate the 
current bulk import process in favor of this.  This could new done by adding 
new APIs to do bulk import with a mapping file and deprecating the current APIs.

> Accepting mapping file for bulk import
> --------------------------------------
>
>                 Key: ACCUMULO-4813
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4813
>             Project: Accumulo
>          Issue Type: Sub-task
>            Reporter: Keith Turner
>            Priority: Major
>             Fix For: 2.0.0
>
>
> During bulk import, inspecting files to determine where they go is expensive 
> and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
> spread the work of inspecting files to random tablet servers.  Because this 
> internal process takes time and consumes resources on the cluster, users want 
> control over it.  The best way to give this control may be to externalize it 
> by allowing bulk imports to have a mapping file.  This mapping file would 
> specify the ranges where files should be loaded.  If Accumulo provided API to 
> help produce this file, then that work could be done in Map Reduce or Spark.  
> This would give users all the control they want over when and where this 
> computation is done.  This would naturally fit in the process used to create 
> the bulk files. 
> To make bulk import fast this mapping file should have the following 
> properties.
>  * Key in file is a range
>  * Value in file is a list of files
>  * Ranges are non overlapping
>  * File is sorted by range/key
>  * Has a mapping for every non-empty file in the bulk import directory.
> If Accumulo provides APIs to do the following operation, then producing the 
> file could written as a map/reduce job.
>  * For a given rfile produce a list of row ranges where the file should be 
> loaded.  These row ranges would be based on tablets.
>  * Merge row range,list of file pairs
>  * Serialize row range,list of files pairs
> With a mapping file, the bulk import algorithm could be written as follows.  
> This could all be executed in the master with no need to run inspection task 
> on random tablet servers.
>  * Sanity check file
>  ** Ensure in sorted order
>  ** Ensure ranges are non-overlapping
>  ** Ensure each file in directory has at least one entry in file
>  ** Ensure all splits in the file exist in the table.
>  * Since file is sorted can do a merged read of file and metadata table, 
> looping over the following operations for each tablet until all files are 
> loaded.
>  ** Read the loaded files for the tablet
>  ** Read the files to load for the range
>  ** For any files not loaded, send an async load message to the tablet server
> The above algorithm can just keep scanning the metadata table and sending 
> async load messages until the bulk import is complete.  Since the load 
> messages are async, the bulk load could of a large number of files could 
> potentially be very fast.
> The bulk load operation can easily handle the case of tablets splitting 
> during the operation by matching a single range in the file to multiple 
> tablets.  However attempting to handle merges would be a lot more tricky.  It 
> would probably be simplest to fail the operation if a merge is detected.  The 
> nice thing is that this can be done in a very clean way.   Once the bulk 
> import operation has the table lock, merges can not happen.  So after getting 
> the table lock the bulk import operation can ensure all splits in the file 
> exist in the table. The operation can abort if the condition is not met 
> before doing any work.  If this condition is not met, it indicates a merge 
> happened between generating the mapping file an doing the bulk import.
> Hopefully the mapping file plus the algorithm that sends async load messages 
> can dramatically speed up bulk import operations.  This may lessen the need 
> for other things like prioritizing bulk import.  To measure this, it would be 
> very useful create a bulk import performance test that can create many files 
> with very little data and measure the time it takes load them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to