[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-8038: -- Resolution: Fixed Release Note: HIVE-8038: Decouple ORC files split calculation logic from fixed-size block assumptions. (Pankit Thapar via Gopal V) Status: Resolved (was: Patch Available) Committed to trunk, thanks [~pankit] > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar >Assignee: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-8038: -- Attachment: HIVE-8038.3.patch +1 - Patch looks good. For commit - .3.patch, removed a white space change & a javadoc. {code} context.splits.add(new OrcSplit(file.getPath(), offset, length, -hosts, fileMetaInfo, isOriginal, hasBase, deltas)); +hosts, fileMetaInfo, isOriginal, hasBase, deltas)); } ... - * @return TreeMap + * @return TreeMap {code} > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar >Assignee: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-8038: -- Status: Open (was: Patch Available) > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar >Assignee: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-8038: -- Status: Patch Available (was: Open) Resubmitting to run tests again. > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar >Assignee: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-8038: -- Assignee: Pankit Thapar > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar >Assignee: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Attachment: HIVE-8038.2.patch Attached the fixed patch as per CR : https://reviews.apache.org/r/25521/ > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.2.patch, HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Status: Patch Available (was: Open) > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Attachment: HIVE-8038.patch > Decouple ORC files split calculation logic from Filesystem's get file > location implementation > - > > Key: HIVE-8038 > URL: https://issues.apache.org/jira/browse/HIVE-8038 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 0.13.1 >Reporter: Pankit Thapar > Fix For: 0.14.0 > > Attachments: HIVE-8038.patch > > > What is the Current Logic > == > 1.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using the array index (index = > offset/blockSize), get the corresponding host having the blockLocation > 4.If the split spans multiple blocks, then get all hosts that have at least > 80% of the max of total data in split hosted by any host. > 5.add the split to a list of splits > Issue with Current Logic > = > Dependency on FileSystem API’s logic for block location calculations. It > returns an array and we need to rely on FileSystem to > make all blocks of same size if we want to directly access a block from the > array. > > What is the Fix > = > 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns > an array of BlockLocation > 1b.convert the array into a tree map and return it > through getLocationsWithOffSet() > 2.In SplitGenerator.createSplit(), check if split only spans one block or > multiple blocks. > 3.If split spans just one block, then using Tree.floorEntry(key), get the > highest entry smaller than offset for the split and get the corresponding > host. > 4a.If the split spans multiple blocks, get a submap, which contains all > entries containing blockLocations from the offset to offset + length > 4b.get all hosts that have at least 80% of the max of total data in split > hosted by any host. > 5.add the split to a list of splits > What are the major changes in logic > == > 1. store BlockLocations in a Map instead of an array > 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() > 3. one block case is checked by "if(offset + length <= start.getOffset() + > start.getLength())" instead of "if((offset % blockSize) + length <= > blockSize)" > What is the affect on Complexity (Big O) > = > 1. We add a O(n) loop to build a TreeMap from an array but its a one time > cost and would not be called for each split > 2. In case of one block case, we can get the block in O(logn) worst case > which was O(1) before > 3. Getting the submap is O(logn) > 4. In case of multiple block case, building the list of hosts is O(m) which > was O(n) & m < n as previously we were iterating >over all the block locations but now we are only iterating only blocks > that belong to that range go offsets that we need. > What are the benefits of the change > == > 1. With this fix, we do not depend on the blockLocations returned by > FileSystem to figure out the block corresponding to the offset and blockSize > 2. Also, it is not necessary that block lengths is same for all blocks for > all FileSystems > 3. Previously we were using blockSize for one block case and block.length for > multiple block case, which is not the case now. We figure out the block >depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)