[ 
https://issues.apache.org/jira/browse/HIVE-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757667#action_12757667
 ] 

He Yongqiang commented on HIVE-819:
-----------------------------------

>>1) in RCFile.c:307 it seems decompress() can be called multiple times and the 
>>function doesn't check if the data is already decompressed, and if so return. 
>>This may not cause problem in this diff since the callers will check if the 
>>data is decompressed or not before calling decompress(), but since it is a 
>>public function and it doesn't prevent future callers call this function 
>>twice. So it may be better to implement this check inside the decompress() 
>>function.

The only entrance of RCFile -> LazyDecompressionCallbackImpl's  decompress() is 
from BytesRefWritable. If we checked if it is already decompressed inside 
BytesRefWritable, do we need to add that check also in 
LazyDecompressionCallbackImpl? 

>>2) Also the same decompress() function, it seems it doesn't work correctly 
>>when the column is not compressed. Can you double check it?
>From my tests, it works correctly for not compressed data.

>>3)
added tests:
{noformat}
DROP TABLE rcfileTableLazyDecompress;
CREATE table rcfileTableLazyDecompress (key STRING, value STRING) STORED AS 
RCFile;

FROM src
INSERT OVERWRITE TABLE rcfileTableLazyDecompress SELECT src.key, src.value 
LIMIT 10;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238 and key < 400;

SELECT key, count(1) FROM rcfileTableLazyDecompress where key > 238 group by 
key;

set mapred.output.compress=true;
set hive.exec.compress.output=true;

FROM src
INSERT OVERWRITE TABLE rcfileTableLazyDecompress SELECT src.key, src.value 
LIMIT 10;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238 and key < 400;

SELECT key, count(1) FROM rcfileTableLazyDecompress where key > 238 group by 
key;

set mapred.output.compress=false;
set hive.exec.compress.output=false;

DROP TABLE rcfileTableLazyDecompress;
{noformat}

Ning, thanks for your suggestions! Do i miss tests for your comments? 
For the check to avoid call decompress multiple times, what do you think if we 
move the check from BytesRefWritable to LazyDecompressionCallbackImpl? There 
still will be 
some minor check duplication.

> Add lazy decompress ability to RCFile
> -------------------------------------
>
>                 Key: HIVE-819
>                 URL: https://issues.apache.org/jira/browse/HIVE-819
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-819-2009-9-12.patch
>
>
> This is especially useful for a filter scanning. 
> For example, for query 'select a, b, c from table_rc_lazydecompress where 
> a>1;' we only need to decompress the block data of b,c columns when one row's 
> column 'a' in that block satisfies the filter condition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to