[ https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361089#comment-16361089 ]
Keith Turner edited comment on ACCUMULO-4813 at 2/12/18 5:06 PM: ----------------------------------------------------------------- Conceptually the contents of this special mapping file would be something like. {code:java} SortedMap<RowRange, List<FileName>> {code} The file would need a special file extension, not sure what it would be. Maybe .lm for load mapping? was (Author: kturner): Conceptually the contents of this special mapping file would something like. {code:java} SortedMap<RowRange, List<FileName>> {code} The file would need a special file extension, not sure what it would be. Maybe .lm for load mapping? > Accepting mapping file for bulk import > -------------------------------------- > > Key: ACCUMULO-4813 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4813 > Project: Accumulo > Issue Type: Sub-task > Reporter: Keith Turner > Priority: Major > Fix For: 2.0.0 > > > During bulk import, inspecting files to determine where they go is expensive > and slow. In order to spread the cost, Accumulo has an internal mechanism to > spread the work of inspecting files to random tablet servers. Because this > internal process takes time and consumes resources on the cluster, users want > control over it. The best way to give this control may be to externalize it > by allowing bulk imports to have a mapping file. This mapping file would > specify the ranges where files should be loaded. If Accumulo provided API to > help produce this file, then that work could be done in Map Reduce or Spark. > This would give users all the control they want over when and where this > computation is done. This would naturally fit in the process used to create > the bulk files. > To make bulk import fast this mapping file should have the following > properties. > * Key in file is a range > * Value in file is a list of files > * Ranges are non overlapping > * File is sorted by range/key > * Has a mapping for every non-empty file in the bulk import directory. > If Accumulo provides APIs to do the following operation, then producing the > file could written as a map/reduce job. > * For a given file produce a list of ranges > * Merge range,list of file pairs > * Serialize range,list of files pairs > With a mapping file, the bulk import algorithm could be written as follows. > This could all be executed in the master with no need to run inspection task > on random tablet servers. > * Sanity check file > ** Ensure in sorted order > ** Ensure ranges are non-overlapping > ** Ensure each file in directory has at least one entry in file > ** Ensure all splits in the file exist in the table. > * Since file is sorted can do a merged read of file and metadata table, > looping over the following operations for each tablet until all files are > loaded. > ** Read the loaded files for the tablet > ** Read the files to load for the range > ** For any files not loaded, send an async load message to the tablet server > The above algorithm can just keep scanning the metadata table and sending > async load messages until the bulk import is complete. Since the load > messages are async, the bulk load could of a large number of files could > potentially be very fast. > The bulk load operation can easily handle the case of tablets splitting > during the operation by matching a single range in the file to multiple > tablets. However attempting to handle merges would be a lot more tricky. It > would probably be simplest to fail the operation if a merge is detected. The > nice thing is that this can be done in a very clean way. Once the bulk > import operation has the table lock, merges can not happen. So after getting > the table lock the bulk import operation can ensure all splits in the file > exist in the table. The operation can abort if the condition is not met > before doing any work. If this condition is not met, it indicates a merge > happened between generating the mapping file an doing the bulk import. > Hopefully the mapping file plus the algorithm that sends async load messages > can dramatically speed up bulk import operations. This may lessen the need > for other things like prioritizing bulk import. To measure this, it would be > very useful create a bulk import performance test that can create many files > with very little data and measure the time it takes load them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)