[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang updated HIVE-352: ------------------------------ Attachment: 4-22 performace2.txt hive-352-2009-4-22-2.patch According to Zheng's suggestions, hive-352-2009-4-22-2.patch made several improvements againds hive-352-2009-4-22.patch: 1) let each row data randomly produced in the test be string bytes( the previous one produces binary bytes) 2) add correctness parameter in performance test to allow test what we read are what we wrote ( in PerformTestRCFileAndSeqFile). 4-22 performace2.txt added more detailed test results: 1. local using bulk decompression,in RCFile->ValueBuffer->readFields(), like: {noformat} bufferRef.write(valueIn, columnBlockPlainDataLength); {noformat} 2. locak not using bulk decompression,in RCFile->ValueBuffer->readFields(), like: {noformat} while(deflateFilter.available()>0) bufferRef.write(valueIn, 1); {noformat} 3. using DistributedFileSystem and bulk decompression, the tests are still run on my local machine Here are the brief results(more detail, pls take a look attached 4-22 performace2.txt) 1. (LocalFileSystem)Use Bulk decompression in RCFile->ValueBuffer->ReadFileds, and adds some noisy between two RCFile reading, and after written to avoid disk cache. column number| RCFile size | RCFile read 1 column | RCFile read 2 column | RCFile read all columns |Sequence file size | sequence file read all 10| 11501112| 259| 181| 498| 13046020| 7002 25| 28725817| 233| 269| 1082| 32246409| 16539 40| 45940679| 261| 301| 1698| 51436799| 25415 2. (LocalFileSystem)Not bulk decompression in RCFile->ValueBuffer->readFileds, and the test adds some noisy between two RCFile reading, and after written to avoid disk cache. column number| RCFile size | RCFile read 1 column | RCFile read 2 column | RCFile read all columns |Sequence file size | sequence file read all 10| 11501112| 1804 | 3262 | 15956 | 13046020| 6927 25| 28725817| 1761 | 3310 | 39492 | 32246409| 15983 40| 45940679| 1843 | 3386 | 63759 | 51436799| 25256 3. (DistributedFileSystem)Use Bulk decompression in RCFile->ValueBuffer->readFileds, and adds some noisy between two RCFile reading, and after written to avoid disk cache. column number| RCFile size | RCFile read 1 column | RCFile read 2 column | RCFile read all columns |Sequence file size | sequence file read all 10| 11501112| 2381| 3516| 9898| 13046020| 18053 25| 28725817| 3754| 5254| 22521| 32246409| 43258 40| 45940679| 5597| 8225| 40304| 51436799| 69278 > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 > progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, > hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, > hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, > HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch > > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.