[jira] [Commented] (HIVE-2192) Stats table schema incompatible after HIVE-2185

2011-06-05 Thread Tomasz Nykiel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044709#comment-13044709
 ] 

Tomasz Nykiel commented on HIVE-2192:
-

Thanks.

 Stats table schema incompatible after HIVE-2185
 ---

 Key: HIVE-2192
 URL: https://issues.apache.org/jira/browse/HIVE-2192
 Project: Hive
  Issue Type: Bug
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Fix For: 0.8.0

 Attachments: HIVE-2192.patch


 HIVE-2185 introduced a new column in the intermediate stats table. This 
 introduces incompatibility between old and new branches (multiple branches 
 could be deployed in production): the old branch will not work with the new 
 schema, and the new branch will not work with the old schema. A solution 
 would be to rename the stats table name (requires code change) or use a 
 different database name (requires hive-default.xml conf change).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-06-03 Thread Tomasz Nykiel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043237#comment-13043237
 ] 

Tomasz Nykiel commented on HIVE-2185:
-

I ran all tests. All quantities were the same as previously, but now the name 
of the metric changed.
Thanks.

 extend table statistics to store the size of uncompressed data (+extend 
 interfaces for collecting other types of statistics)
 

 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel
 Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch


 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. Other 
 statistics (e.g., total table/partition size) are derived from the file 
 system. 
 Here, we want to collect information about the sizes of uncompressed data, to 
 be able to determine the efficiency of compression.
 Currently, a large part of statistics collection mechanism is hardcoded and 
 not-easily extensible for other statistics.
 On top of adding the new statistic collected, it would be desirable to extend 
 the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2192) Stats table schema incompatible after HIVE-2185

2011-06-03 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2192:


Attachment: HIVE-2192.patch

-Changes the table name used for collecting intermediate statistics by 
JDBCStatsPublisher
-removes an empty TestStatsPublisher.java

 Stats table schema incompatible after HIVE-2185
 ---

 Key: HIVE-2192
 URL: https://issues.apache.org/jira/browse/HIVE-2192
 Project: Hive
  Issue Type: Bug
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2192.patch


 HIVE-2185 introduced a new column in the intermediate stats table. This 
 introduces incompatibility between old and new branches (multiple branches 
 could be deployed in production): the old branch will not work with the new 
 schema, and the new branch will not work with the old schema. A solution 
 would be to rename the stats table name (requires code change) or use a 
 different database name (requires hive-default.xml conf change).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: Stats table schema incompatible after HIVE-2185

2011-06-03 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/851/
---

Review request for hive, Hairong Kuang and Ning Zhang.


Summary
---

HIVE-2185 introduces new statistics collected by StatsPublisher, which causes a 
change in the schema of the table used for collecting intermediate statistics 
by JDBCStatsPublisher.
This patch changes the table name, to avoid conflicts with the previous version.

Also this patch removes an empty JUnit java file.


This addresses bug HIVE-2192.
https://issues.apache.org/jira/browse/HIVE-2192


Diffs
-

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1131108 

Diff: https://reviews.apache.org/r/851/diff


Testing
---

JDBCStatsPublisherEnhanced JUnit test.


Thanks,

Tomasz



[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-06-02 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2185:


Attachment: HIVE-2185.2.patch

Fixed some minor issues.
Renamed the metric to rawDataSize

 extend table statistics to store the size of uncompressed data (+extend 
 interfaces for collecting other types of statistics)
 

 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel
 Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch


 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. Other 
 statistics (e.g., total table/partition size) are derived from the file 
 system. 
 Here, we want to collect information about the sizes of uncompressed data, to 
 be able to determine the efficiency of compression.
 Currently, a large part of statistics collection mechanism is hardcoded and 
 not-easily extensible for other statistics.
 On top of adding the new statistic collected, it would be desirable to extend 
 the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review718
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
https://reviews.apache.org/r/785/#comment1457

should be:

long current = 0;
SerDeStats st = this.deserializer.getSerDeStats();
if(st != null) {
  current = st.getUncompressedSize();
}

since we are not checking by hard which serde class is in use, and some the 
unsupported classes return NULL


- Tomasz


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/785/
 ---
 
 (Updated 2011-05-26 02:52:55)
 
 
 Review request for hive.
 
 
 Summary
 ---
 
 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. 
 Other statistics (e.g., total table/partition size) are derived from the file 
 system.
 
 We introduce a new feature for collecting information about the sizes of 
 uncompressed data, to be able to determine the efficiency of compression.
 On top of adding the new statistic collected, this patch extends the stats 
 collection mechanism, so any new statistics could be added easily.
 
 1. serializer/deserializer classes are amended to accommodate collecting 
 sizes of uncompressed data, when serializing/deserializing objects.
 We support:
 
 Columnar SerDe
 LazySimpleSerDe
 LazyBinarySerDe
 
 For other SerDe classes the uncompressed siez will be 0.
 
 2. StatsPublisher / StatsAggregator interfaces are extended to support 
 multi-stats collection for both JDBC and HBase.
 
 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
 TableScanOperator respectively are extended to support multi-stats collection.
 
 (2) and (3) enable easy extension for other types of statistics.
 
 4. Collecting uncompressed size can be disabled by setting:
 
 hive.stats.collect.uncompressedsize = false
 
 
 This addresses bug HIVE-2185.
 https://issues.apache.org/jira/browse/HIVE-2185
 
 
 Diffs
 -
 
   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
 1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
  1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
  1127756 
   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
 1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
  PRE-CREATION 
   trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
 1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
  1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
 1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
  1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
  1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
  1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
 PRE-CREATION 
   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
 1127756 
   
 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
  PRE-CREATION 
   trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2185:


Attachment: HIVE-2185.1.patch

 extend table statistics to store the size of uncompressed data (+extend 
 interfaces for collecting other types of statistics)
 

 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel
 Attachments: HIVE-2185.1.patch, HIVE-2185.patch


 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. Other 
 statistics (e.g., total table/partition size) are derived from the file 
 system. 
 Here, we want to collect information about the sizes of uncompressed data, to 
 be able to determine the efficiency of compression.
 Currently, a large part of statistics collection mechanism is hardcoded and 
 not-easily extensible for other statistics.
 On top of adding the new statistic collected, it would be desirable to extend 
 the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-26 Thread Tomasz Nykiel


 On 2011-05-26 21:12:30, Ning Zhang wrote:
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java,
   line 100
  https://reviews.apache.org/r/785/diff/1/?file=19586#file19586line100
 
  should be = here

Yes.


 On 2011-05-26 21:12:30, Ning Zhang wrote:
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java,
   line 82
  https://reviews.apache.org/r/785/diff/1/?file=19585#file19585line82
 
  Isn't isValidStatics() should take key as a parameter rather than 
  rowID? key should indicate which statistics this is right?

Yes. It was a bug, I fixed already, once I ran the HBase JUnit :)


- Tomasz


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review719
---


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/785/
 ---
 
 (Updated 2011-05-26 02:52:55)
 
 
 Review request for hive.
 
 
 Summary
 ---
 
 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. 
 Other statistics (e.g., total table/partition size) are derived from the file 
 system.
 
 We introduce a new feature for collecting information about the sizes of 
 uncompressed data, to be able to determine the efficiency of compression.
 On top of adding the new statistic collected, this patch extends the stats 
 collection mechanism, so any new statistics could be added easily.
 
 1. serializer/deserializer classes are amended to accommodate collecting 
 sizes of uncompressed data, when serializing/deserializing objects.
 We support:
 
 Columnar SerDe
 LazySimpleSerDe
 LazyBinarySerDe
 
 For other SerDe classes the uncompressed siez will be 0.
 
 2. StatsPublisher / StatsAggregator interfaces are extended to support 
 multi-stats collection for both JDBC and HBase.
 
 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
 TableScanOperator respectively are extended to support multi-stats collection.
 
 (2) and (3) enable easy extension for other types of statistics.
 
 4. Collecting uncompressed size can be disabled by setting:
 
 hive.stats.collect.uncompressedsize = false
 
 
 This addresses bug HIVE-2185.
 https://issues.apache.org/jira/browse/HIVE-2185
 
 
 Diffs
 -
 
   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
 1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
  1127756 
   
 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
  1127756 
   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
 1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
  1127756 
   
 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
  PRE-CREATION 
   trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
 1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
  1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 
 1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
 1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
  1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
  1127756 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
  1127756 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
 PRE-CREATION

[jira] [Created] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel (JIRA)
extend table statistics to store the size of uncompressed data (+extend 
interfaces for collecting other types of statistics)


 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel


Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. Other 
statistics (e.g., total table/partition size) are derived from the file system. 

Here, we want to collect information about the sizes of uncompressed data, to 
be able to determine the efficiency of compression.
Currently, a large part of statistics collection mechanism is hardcoded and 
not-easily extensible for other statistics.
On top of adding the new statistic collected, it would be desirable to extend 
the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2185:


Attachment: HIVE-2185.patch

 extend table statistics to store the size of uncompressed data (+extend 
 interfaces for collecting other types of statistics)
 

 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel
 Attachments: HIVE-2185.patch


 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. Other 
 statistics (e.g., total table/partition size) are derived from the file 
 system. 
 Here, we want to collect information about the sizes of uncompressed data, to 
 be able to determine the efficiency of compression.
 Currently, a large part of statistics collection mechanism is hardcoded and 
 not-easily extensible for other statistics.
 On top of adding the new statistic collected, it would be desirable to extend 
 the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

Review request for hive.


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1127756 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
  

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-24 Thread Tomasz Nykiel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038757#comment-13038757
 ] 

Tomasz Nykiel commented on HIVE-2144:
-

Thanks :)

 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Fix For: 0.8.0

 Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-23 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2144:


Attachment: HIVE-2144.2.patch

 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review702
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
https://reviews.apache.org/r/765/#comment1402

Yes. That's correct.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1403

ok.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1404

I will amend the test cases to aggregate over prefixes. I will also add one 
simple test case to aggregate over exact match.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1405

The original value inserted in line 120 is 200. Neither 100, nor 150 should 
change the values. 



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1406

As disscussed before, I will improve the test cases to aggregate over 
prefixes.


- Tomasz


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/765/
 ---
 
 (Updated 2011-05-19 23:14:26)
 
 
 Review request for hive.
 
 
 Summary
 ---
 
 Currently, the JDBCStatsPublisher executes two queries per inserted row of 
 statistics, first query to check if the ID was inserted by another task, and 
 second query to insert a new or update the existing row.
 The latter occurs very rarely, since duplicates most likely originate from 
 speculative failed tasks.
 
 Currently the schema of the stat table is the following:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
 any integrity constraints declared.
 
 We amend it to:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
 
 HIVE-2144 improves on performance by greedily performing the insertion 
 statement.
 Then instead of executing two queries per row inserted, we can execute one 
 INSERT query.
 In the case primary key constraint violation, we perform a single UPDATE 
 query.
 The UPDATE query needs to check the condition, if the currently inserted 
 stats are newer then the ones already in the table.
 
 
 This addresses bug HIVE-2144.
 https://issues.apache.org/jira/browse/HIVE-2144
 
 
 Diffs
 -
 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
  1125140 
   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/765/diff
 
 
 Testing
 ---
 
 TestStatsPublisher JUnit test:
 - basic behaviour
 - multiple updates
 - cleanup of the statistics table after aggregation
 
 Standalone testing on the cluster.
 - insert/analyze queries over non-partitioned/partitioned tables
 
 NOTE. For the correct behaviour, the primary_key index needs to be created, 
 or the PARTITION_STAT_TABLE table dropped - which triggers creation of the 
 table with the constraint declared.
 
 
 Thanks,
 
 Tomasz
 




[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2144:


Attachment: HIVE-2144.1.patch

Fixed after revision 1.

 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.1.patch, HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-19 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2144:


Attachment: HIVE-2144.patch

 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

2011-05-19 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
---

Review request for hive.


Summary
---

Currently, the JDBCStatsPublisher executes two queries per inserted row of 
statistics, first query to check if the ID was inserted by another task, and 
second query to insert a new or update the existing row.
The latter occurs very rarely, since duplicates most likely originate from 
speculative failed tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
any integrity constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion 
statement.
Then instead of executing two queries per row inserted, we can execute one 
INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats 
are newer then the ones already in the table.


This addresses bug HIVE-2144.
https://issues.apache.org/jira/browse/HIVE-2144


Diffs
-

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1125140 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
---

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or 
the PARTITION_STAT_TABLE table dropped - which triggers creation of the table 
with the constraint declared.


Thanks,

Tomasz



[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-18 Thread Tomasz Nykiel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035493#comment-13035493
 ] 

Tomasz Nykiel commented on HIVE-2144:
-

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
any integrity constraints declared.

We can amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ).

Then instead of executing two queries per row inserted, we can execute one 
INSERT query, as we do currently.
In the case when the integrity constraint is violated, via the unique index, 
which can be caught by an exception, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats 
are newer then the ones already in the table:

UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value
WHERE ID = rowID AND
(0)new_value 
(1)(SELECT TEMP.ROW_COUNT FROM
(2)(SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = 
rowID) TEMP )

--(0) is a condition that checks if the newly inserted value is greater that 
the one we already have.
--(1) and (2) is a work-around for MySQL, which does not allow to refer to the 
table that occurs in the update statement. Here, we basically materialize the 
value that we need for comparison.
--(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however 
Derby does not support it, and by the unique constraint, and the fact that the 
insert failed, there exists exactly one tuple matching the ID predicate.

To summarize, for non existing rows, only one insert query will be executed, 
instead of two.
For existing rows, which seems to occur very infrequently, two queries instead 
of three will be executed.


 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel

 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-18 Thread Tomasz Nykiel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035634#comment-13035634
 ] 

Tomasz Nykiel commented on HIVE-2144:
-

Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby 
and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it 
does not. So in general, PK constraint will be more suitable.

CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT 
BIGINT ) works for both Derby and MySql.
After a quick check it seems that it's supported by Oracle/MSSQL as well.



 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel

 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira