[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-18 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-8038:
--
  Resolution: Fixed
Release Note: HIVE-8038: Decouple ORC files split calculation logic from 
fixed-size block assumptions. (Pankit Thapar via Gopal V)
  Status: Resolved  (was: Patch Available)

Committed to trunk, thanks [~pankit]

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
>Assignee: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-16 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-8038:
--
Attachment: HIVE-8038.3.patch

+1 - Patch looks good.

For commit - .3.patch, removed a white space change & a javadoc.

{code}
  context.splits.add(new OrcSplit(file.getPath(), offset, length,
 -hosts, fileMetaInfo, isOriginal, hasBase, deltas));
 +hosts, fileMetaInfo, isOriginal, hasBase, deltas));
}
...
-   * @return TreeMap
+  * @return TreeMap
{code}

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
>Assignee: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-8038:
--
Status: Open  (was: Patch Available)

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
>Assignee: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-8038:
--
Status: Patch Available  (was: Open)

Resubmitting to run tests again.

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
>Assignee: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-8038:
--
Assignee: Pankit Thapar

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
>Assignee: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Attachment: HIVE-8038.2.patch

Attached the fixed patch as per CR : https://reviews.apache.org/r/25521/

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.2.patch, HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Status: Patch Available  (was: Open)

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Attachment: HIVE-8038.patch

> Decouple ORC files split calculation logic from Filesystem's get file 
> location implementation
> -
>
> Key: HIVE-8038
> URL: https://issues.apache.org/jira/browse/HIVE-8038
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.13.1
>Reporter: Pankit Thapar
> Fix For: 0.14.0
>
> Attachments: HIVE-8038.patch
>
>
> What is the Current Logic
> ==
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using the array index (index = 
> offset/blockSize), get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 
> 80% of the max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =
> Dependency on FileSystem API’s logic for block location calculations. It 
> returns an array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the 
> array.
>  
> What is the Fix
> =
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
> an array of BlockLocation
> 1b.convert the array into a tree map  and return it 
> through getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or 
> multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the 
> highest entry smaller than offset for the split and get the corresponding 
> host.
> 4a.If the split spans multiple blocks, get a submap, which contains all 
> entries containing blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split 
> hosted by any host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + 
> start.getLength())"  instead of "if((offset % blockSize) + length <= 
> blockSize)"
> What is the affect on Complexity (Big O)
> =
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
> cost and would not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case 
> which was O(1) before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which 
> was O(n) & m < n as previously we were iterating 
>over all the block locations but now we are only iterating only blocks 
> that belong to that range go offsets that we need. 
> What are the benefits of the change
> ==
> 1. With this fix, we do not depend on the blockLocations returned by 
> FileSystem to figure out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for 
> all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for 
> multiple block case, which is not the case now. We figure out the block
>depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)