Pankit Thapar created HIVE-8038:
-----------------------------------

             Summary: Decouple ORC files split calculation logic from 
Filesystem's get file location implementation
                 Key: HIVE-8038
                 URL: https://issues.apache.org/jira/browse/HIVE-8038
             Project: Hive
          Issue Type: Improvement
          Components: File Formats
    Affects Versions: 0.13.1
            Reporter: Pankit Thapar
             Fix For: 0.14.0


What is the Current Logic
======================
1.get the file blocks from FileSystem.getFileBlockLocations() which returns an 
array of BlockLocation
2.In SplitGenerator.createSplit(), check if split only spans one block or 
multiple blocks.
3.If split spans just one block, then using the array index (index = 
offset/blockSize), get the corresponding host having the blockLocation
4.If the split spans multiple blocks, then get all hosts that have at least 80% 
of the max of total data in split hosted by any host.
5.add the split to a list of splits

Issue with Current Logic
=====================
Dependency on FileSystem API’s logic for block location calculations. It 
returns an array and we need to rely on FileSystem to  
make all blocks of same size if we want to directly access a block from the 
array.
 
What is the Fix
=============
1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an 
array of BlockLocation
1b.convert the array into a tree map <offset, BlockLocation> and return it 
through getLocationsWithOffSet()
2.In SplitGenerator.createSplit(), check if split only spans one block or 
multiple blocks.
3.If split spans just one block, then using Tree.floorEntry(key), get the 
highest entry smaller than offset for the split and get the corresponding host.
4a.If the split spans multiple blocks, get a submap, which contains all entries 
containing blockLocations from the offset to offset + length
4b.get all hosts that have at least 80% of the max of total data in split 
hosted by any host.
5.add the split to a list of splits

What are the major changes in logic
==============================
1. store BlockLocations in a Map instead of an array
2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
3. one block case is checked by "if(offset + length <= start.getOffset() + 
start.getLength())"  instead of "if((offset % blockSize) + length <= blockSize)"

What is the affect on Complexity (Big O)
=================================

1. We add a O(n) loop to build a TreeMap from an array but its a one time cost 
and would not be called for each split
2. In case of one block case, we can get the block in O(logn) worst case which 
was O(1) before
3. Getting the submap is O(logn)
4. In case of multiple block case, building the list of hosts is O(m) which was 
O(n) & m < n as previously we were iterating 
   over all the block locations but now we are only iterating only blocks that 
belong to that range go offsets that we need. 

What are the benefits of the change
==============================
1. With this fix, we do not depend on the blockLocations returned by FileSystem 
to figure out the block corresponding to the offset and blockSize
2. Also, it is not necessary that block lengths is same for all blocks for all 
FileSystems
3. Previously we were using blockSize for one block case and block.length for 
multiple block case, which is not the case now. We figure out the block
   depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to