Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-06-02 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

(Updated 2011-06-02 20:36:48.205733)


Review request for hive.


Changes
---

-Fixed issues pointed out in the review.
-Changed metric name to rawDataSize instead of uncompressedSize


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs (updated)
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1130791 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1130791 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1130791 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1130791 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1130791 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1130791 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1130791 
  trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1130791 
  trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1130791 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1130791 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1130791 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1130791 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1130791 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1130791 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1130791 
  trunk/ql/src/test/resul

Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-31 Thread Ning Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review725
---



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java


For a better debugging info, print out the key and the valid stats keys. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java


currentValue + amount will result in object creation. This is very 
expensive in the this case since this function is called for every input row. 
Instead of using immutable class Long, LongWritable maybe a better choice. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java


Also consider using LongWritable rather than Long. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java


LongWritable. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java


Can you print the stack trace to LOG rather than to console?



trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java


declaration should be interface List rather than implementation ArrayList.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java


Better use Map rather than HashMap in declaration



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java


Can you change it to use Utilities.executeWithRetry() as well?



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java


Let's also put the comment here as in other statements. 



trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out


The uncompressed size is smaller than the totalSize. Can you double check 
if this is because of the overhead (headers etc) in the fileformat or because 
of a bug in the stats?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java


Please add some comments here on what it is used for.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java


Would it make sense to add the size of field delimiters as well? And if we 
know the record delimiters (for most record reader it is a newline), we can add 
that too. This will make the stats more accurately reflect the real 
uncompressed size stored in the file.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java


may be simpler just use == rather than ! and ^.  Also consider assert 
rather than returning null?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java


same as above.


- Ning


On 2011-05-26 21:27:34, Tomasz Nykiel wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/785/
> ---
> 
> (Updated 2011-05-26 21:27:34)
> 
> 
> Review request for hive.
> 
> 
> Summary
> ---
> 
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
> collect statistics about the number of rows per partition/table. 
> Other statistics (e.g., total table/partition size) are derived from the file 
> system.
> 
> We introduce a new feature for collecting information about the sizes of 
> uncompressed data, to be able to determine the efficiency of compression.
> On top of adding the new statistic collected, this patch extends the stats 
> collection mechanism, so any new statistics could be added easily.
> 
> 1. serializer/deserializer classes are amended to accommodate collecting 
> sizes of uncompressed data, when serializing/deserializing objects.
> We support:
> 
> Columnar SerDe
> LazySimpleSerDe
> LazyBinarySerDe
> 
> For other SerDe classes the uncompressed siez will be 0.
> 
> 2. StatsPublisher / StatsAggregator interfaces are extended to support 
> multi-stats collection for both JDBC and HBase.
> 
> 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
> TableScanOperator respectively are extended to support multi-stats collection.
> 
> (2) and (3) enable easy extension for other types of statistics.
> 
> 4. Collecting uncompressed size can be disabled by setting:
> 
> hive.stats.collect.uncompressedsize = false
> 
> 
> This addresses bug HIVE-

Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

(Updated 2011-05-26 21:27:34.475653)


Review request for hive.


Changes
---

-Fixed HBase stats publishing


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs (updated)
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1128070 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1128070 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1128070 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1128070 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1128070 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1128070 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1128070 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1128070 
  trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1128070 
  trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1128070 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1128070 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1128070 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1128070 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1128070 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1128070 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1128070 
  trunk/ql/src/test/results/clien

Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel


> On 2011-05-26 21:12:30, Ning Zhang wrote:
> > trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java,
> >  line 100
> > 
> >
> > should be >= here

Yes.


> On 2011-05-26 21:12:30, Ning Zhang wrote:
> > trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java,
> >  line 82
> > 
> >
> > Isn't isValidStatics() should take "key" as a parameter rather than 
> > "rowID"? "key" should indicate which statistics this is right?

Yes. It was a bug, I fixed already, once I ran the HBase JUnit :)


- Tomasz


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review719
---


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/785/
> ---
> 
> (Updated 2011-05-26 02:52:55)
> 
> 
> Review request for hive.
> 
> 
> Summary
> ---
> 
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
> collect statistics about the number of rows per partition/table. 
> Other statistics (e.g., total table/partition size) are derived from the file 
> system.
> 
> We introduce a new feature for collecting information about the sizes of 
> uncompressed data, to be able to determine the efficiency of compression.
> On top of adding the new statistic collected, this patch extends the stats 
> collection mechanism, so any new statistics could be added easily.
> 
> 1. serializer/deserializer classes are amended to accommodate collecting 
> sizes of uncompressed data, when serializing/deserializing objects.
> We support:
> 
> Columnar SerDe
> LazySimpleSerDe
> LazyBinarySerDe
> 
> For other SerDe classes the uncompressed siez will be 0.
> 
> 2. StatsPublisher / StatsAggregator interfaces are extended to support 
> multi-stats collection for both JDBC and HBase.
> 
> 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
> TableScanOperator respectively are extended to support multi-stats collection.
> 
> (2) and (3) enable easy extension for other types of statistics.
> 
> 4. Collecting uncompressed size can be disabled by setting:
> 
> hive.stats.collect.uncompressedsize = false
> 
> 
> This addresses bug HIVE-2185.
> https://issues.apache.org/jira/browse/HIVE-2185
> 
> 
> Diffs
> -
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
> 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
>  1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
>  1127756 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
> 1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
>  PRE-CREATION 
>   trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
>   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
>  1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/sta

Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Ning Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review719
---



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java


Isn't isValidStatics() should take "key" as a parameter rather than 
"rowID"? "key" should indicate which statistics this is right?



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java


should be >= here


- Ning


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/785/
> ---
> 
> (Updated 2011-05-26 02:52:55)
> 
> 
> Review request for hive.
> 
> 
> Summary
> ---
> 
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
> collect statistics about the number of rows per partition/table. 
> Other statistics (e.g., total table/partition size) are derived from the file 
> system.
> 
> We introduce a new feature for collecting information about the sizes of 
> uncompressed data, to be able to determine the efficiency of compression.
> On top of adding the new statistic collected, this patch extends the stats 
> collection mechanism, so any new statistics could be added easily.
> 
> 1. serializer/deserializer classes are amended to accommodate collecting 
> sizes of uncompressed data, when serializing/deserializing objects.
> We support:
> 
> Columnar SerDe
> LazySimpleSerDe
> LazyBinarySerDe
> 
> For other SerDe classes the uncompressed siez will be 0.
> 
> 2. StatsPublisher / StatsAggregator interfaces are extended to support 
> multi-stats collection for both JDBC and HBase.
> 
> 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
> TableScanOperator respectively are extended to support multi-stats collection.
> 
> (2) and (3) enable easy extension for other types of statistics.
> 
> 4. Collecting uncompressed size can be disabled by setting:
> 
> hive.stats.collect.uncompressedsize = false
> 
> 
> This addresses bug HIVE-2185.
> https://issues.apache.org/jira/browse/HIVE-2185
> 
> 
> Diffs
> -
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
> 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
>  1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
>  1127756 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
> 1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
>  PRE-CREATION 
>   trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
>   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
>  1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
>  1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
> PRE-CREATION 
>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
> 1127756 
>   
> trunk/ql/src/test/org/apache/

Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review718
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java


should be:

long current = 0;
SerDeStats st = this.deserializer.getSerDeStats();
if(st != null) {
  current = st.getUncompressedSize();
}

since we are not checking by hard which serde class is in use, and some the 
unsupported classes return NULL


- Tomasz


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/785/
> ---
> 
> (Updated 2011-05-26 02:52:55)
> 
> 
> Review request for hive.
> 
> 
> Summary
> ---
> 
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
> collect statistics about the number of rows per partition/table. 
> Other statistics (e.g., total table/partition size) are derived from the file 
> system.
> 
> We introduce a new feature for collecting information about the sizes of 
> uncompressed data, to be able to determine the efficiency of compression.
> On top of adding the new statistic collected, this patch extends the stats 
> collection mechanism, so any new statistics could be added easily.
> 
> 1. serializer/deserializer classes are amended to accommodate collecting 
> sizes of uncompressed data, when serializing/deserializing objects.
> We support:
> 
> Columnar SerDe
> LazySimpleSerDe
> LazyBinarySerDe
> 
> For other SerDe classes the uncompressed siez will be 0.
> 
> 2. StatsPublisher / StatsAggregator interfaces are extended to support 
> multi-stats collection for both JDBC and HBase.
> 
> 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
> TableScanOperator respectively are extended to support multi-stats collection.
> 
> (2) and (3) enable easy extension for other types of statistics.
> 
> 4. Collecting uncompressed size can be disabled by setting:
> 
> hive.stats.collect.uncompressedsize = false
> 
> 
> This addresses bug HIVE-2185.
> https://issues.apache.org/jira/browse/HIVE-2185
> 
> 
> Diffs
> -
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
> 1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
>  1127756 
>   
> trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
>  1127756 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
> 1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
>  1127756 
>   
> trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
>  PRE-CREATION 
>   trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
>   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
>  1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 
> 1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
> 1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
>  1127756 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
>  1127756 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
> PRE-CREATION 
>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
> 1127756 
>   
> trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestS

Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

Review request for hive.


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1127756 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge4.q.out