[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701837#action_12701837 ]
He Yongqiang commented on HIVE-352: ----------------------------------- Thanks, Zheng. >>0. Did you try that with hadoop 0.17.0? "ant -Dhadoop.version=0.17.0 test" >>etc. yes. >>1. Can you add your tests to ant, or post the testing scripts so that >>everybody can easily reproduce the test results that you have got? I will do that with next patch >>2. For DistributedFileSystem, how big is the cluster? Is the file (the file >>size is small so it's clearly a single block) local? The cluster is of six nodes. The file is not local. The test was run on my local machine, and use HDFS. >>3. It seems SequenceFile's compression is not as good as RCFile, although the >>data is the same and also random. What is the exact record format in >>sequencefile? Did you >>put delimitors or you put length of Strings? yes, it has length of Strings.However, RCFile also has the length of strings >>The approach of store compressed data at creation, and do bulk decompression >>at reading is not practical because it's very easy to go out of memory. Yes, I encountered Out of memory error. So i added some trick in RCFile.Writer's append. Like {noformat} if ((columnBufferSize + (this.bufferedRecords * this.columnNumber * 2) > COLUMNS_BUFFER_SIZE) || (this.bufferedRecords >= this.RECORD_INTERVAL)) { flushRecords(); } {noformat} >>We've done BULK, and it showed great performance (1.6s to read and decompress >>40MB local file), but I suspect the compression ratio will be lower than >>NONBULK. >>Can you compare the compression ratio of BULK and NONBULK, given different >>buffer sizes and column numbers? BULK and NONBULK( they mean decompress) are only for Read, they have nothing to do with Write, so I guess it will not influence compression ratio. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 > progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, > hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, > hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, > HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch > > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.