[jira] Commented: (HIVE-352) Make Hive support column based storage

Zheng Shao (JIRA) Thu, 30 Apr 2009 05:40:55 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704617#action_12704617
 ]


Zheng Shao commented on HIVE-352:
---------------------------------

hive-352-2009-4-30-4.patch:

Thanks Yongqiang. I tried it and it works now.

* exec/Utilities.java:createRCFileWriter: 
parameter fs never used.

* RCFile.java: 
Can you add some comments in the header to explain how you determine "row 
split"?
What does this mean?  <li>A sync-marker every few <code>100</code> bytes or 
so.</li>
KeyBuffer.write(DataInput in, int length): length never used. Is this method 
ever used? If not, remove it.
ColumnBuffer.getLength(): Let's remove the synchronized keyword for the sake of 
performance.
ColumnBuffer.ColumnBuffer(Class valClass): Let's remove valClass since it's 
never used.
Writer: how do you pass the column number from Hive to the configuration and 
then to the RCFIle.Writer?
+      this.columnNumber = conf.getInt(COLUMN_NUMBER_CONF_STR, 0);
Writer: Per Java convention, COLUMNS_BUFFER_SIZE should be columnsBufferSize.
+      this.COLUMNS_BUFFER_SIZE = conf.getInt(COLUMNS_BUFFER_SIZE_CONF_STR,
+          4 * 1024 * 1024);
Reader: add javadoc for "public synchronized boolean nextColumnsBatch()"

* RCFileRecordReader.java
Remove "synchronized" in all methods for the sake of performance
+  protected synchronized boolean next(LongWritable key) throws IOException {

* TestRCFile.java
You might want to use Text.encode() instead of "String".getBytes(). At least we 
should use getBytes("UTF-8").

* ql/src/test/queries/clientpositive/columnarserde_create_shortcut.q
For all tests, please drop the created tables again at the end of the .q file. 
This helps to make "show tables" in other .q files return a deterministic 
result.

* BytesRefArrayWritable.java
What's the difference between get() and unCheckedGet()? The code is the same
BytesRefArrayWritable.resetValid: Let's call it resize. 
set(..): should be valid <= index
+    if (valid < index)
+      valid = index + 1;

* BytesRefWritable.java
Shall we rename getBytes() to getBytesCopy()?

* ColumnarStruct.java
init(...): Instead of clearing out the object, we can do set(null,0,0) which 
has the same effect but we don't need to create LazyObject for the next row.
{code}
Current:
+      if (field.length > 0) {
+        if (fields[fieldIndex] == null)
+          fields[fieldIndex] = LazyFactory.createLazyObject(fieldTypeInfos
+              .get(fieldIndex));
+        fields[fieldIndex].init(cachedByteArrayRef[fieldIndex], field
+            .getStart(), field.getLength());
+      } else if (fields[fieldIndex] != null
+          && fields[fieldIndex].getObject() != null) {
+        fields[fieldIndex] = null;
+      }
Let's change it to (Please have a test)
+      if (fields[fieldIndex] == null)
+        fields[fieldIndex] = LazyFactory.createLazyObject(fieldTypeInfos
+            .get(fieldIndex));
+      if (field.length > 0) {
+        fields[fieldIndex].init(cachedByteArrayRef[fieldIndex], field
+            .getStart(), field.getLength());
+      } else {
+        fields[fieldIndex].init(null, 0, 0);
+      }


* Can you add one more test for adding/deleting columns? The value of the newly 
added columns should be NULL. RCFile should also be able to ignore the extra 
columns from the data.
{code}
CREAT TABLE xxx  (a string, b string) STORED AS RCFILE;
INSERT OVERWRITE TABLE xxx ...;
ALTER TABLE xxx ADD COLUMNS (c string);
SELECT * FROM xxx;
ALTER TABLE xxx REPLACE COLUMNS (a string);
SELECT * FROM xxx;
{code}

* General comments:
1 We don't need to put "this." in the code unless there is a local var with the 
same name.
2 Please add more javadoc - the general rule for Hive is that every public 
class/method should have javadoc, except getters and setters.

** Very important: Eclipse automatically generates a lot of empty javadocs for 
variables etc. Please remove them if you don't have a comment for that 
variable, otherwise "ant -Dhadoop.version=0.17.0 javadoc" will have warnings. 
Please make sure to remove all warnings.
{code}
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/serde/src/java/org/apache/hadoop/hive/serde2/columnar/BytesRefWritable.java:95:
 warning - @return tag has no arguments.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/serde/src/java/org/apache/hadoop/hive/serde2/columnar/BytesRefWritable.java:122:
 warning - Tag @see: missing '#': "set(byte[] newData, int offset, int length)"
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/serde/src/java/org/apache/hadoop/hive/serde2/columnar/BytesRefWritable.java:105:
 warning - Tag @see: missing '#': "readFields(DataInput in)"
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java:38:
 warning - Tag @link: missing '#': "init(BytesRefArrayWritable cols)"
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:522:
 warning - @param argument "keyClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:522:
 warning - @param argument "valClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:149:
 warning - Tag @see: malformed: "CompressionCodec, SequenceFile"
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:149:
 warning - Tag @see: reference not found: CompressionCodec, SequenceFile
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:1214:
 warning - @return tag has no arguments.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:1214:
 warning - Tag @link: missing '#': "nextColumnBatch()"
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:1214:
 warning - Tag @link: can't find nextColumnBatch() in 
org.apache.hadoop.hive.ql.io.RCFile.Reader
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:1288:
 warning - @return tag has no arguments.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:1288:
 warning - @return tag cannot be used in method with void return type.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:543:
 warning - @param argument "keyClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:543:
 warning - @param argument "valClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:564:
 warning - @param argument "keyClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:564:
 warning - @param argument "valClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:591:
 warning - @param argument "keyClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java:591:
 warning - @param argument "valClass" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java:105:
 warning - @return tag has no arguments.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java:105:
 warning - @param argument "tableInfo" is not a parameter name.
  [javadoc] 
/data/users/zshao/tools/352-trunk-apache-hive/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java:745:
 warning - @param argument "dest" is not a parameter name.
{code}


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, 
> hive-352-2009-4-30-2.patch, hive-352-2009-4-30-3.patch, 
> hive-352-2009-4-30-4.patch, HIve-352-draft-2009-03-28.patch, 
> Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to