[
https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated HIVE-17182:
------------------------------------
Description:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name data_type comment
ss_sold_time_sk bigint
ss_item_sk bigint
ss_customer_sk bigint
ss_cdemo_sk bigint
ss_hdemo_sk bigint
ss_addr_sk bigint
ss_store_sk bigint
ss_promo_sk bigint
ss_ticket_number bigint
ss_quantity int
ss_wholesale_cost double
ss_list_price double
ss_sales_price double
ss_ext_discount_amt double
ss_ext_sales_price double
ss_ext_wholesale_cost double
ss_ext_list_price double
ss_ext_tax double
ss_coupon_amt double
ss_net_paid double
ss_net_paid_inc_tax double
ss_net_profit double
# Partition Information
# col_name data_type comment
ss_sold_date_sk bigint
# Detailed Table Information
Database: tpcds_bin_partitioned_parquet_200
Owner: root
CreateTime: Tue Jun 06 11:51:48 CST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location:
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 2023
numPartitions 1824
numRows 575995635
rawDataSize 12671903970
totalSize 46465926745
transient_lastDdlTime 1496721108
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G /tmp/tpcds-generate/200/store_sales
{noformat}
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}
It seems that the rawDataSize is nearly 75G but in "describe formatted
store_sales" command, it shows only 12G.
I tried to use "analyze table store_sales compute statistics for columns" to
update the statistics but there is no change for RAWDATASIZE;
I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute
statistics no scan" to update the statistics but fail, the error is
{code}
2017-09-28T03:21:04,849 INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning]
could not update stats for
tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed
with exception Missing timezone id for parquet int96 conversion!
java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96
conversion!
at
org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:89)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:59)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86)
at
org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
a bug is tracked by HIVE-12767
was:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name data_type comment
ss_sold_time_sk bigint
ss_item_sk bigint
ss_customer_sk bigint
ss_cdemo_sk bigint
ss_hdemo_sk bigint
ss_addr_sk bigint
ss_store_sk bigint
ss_promo_sk bigint
ss_ticket_number bigint
ss_quantity int
ss_wholesale_cost double
ss_list_price double
ss_sales_price double
ss_ext_discount_amt double
ss_ext_sales_price double
ss_ext_wholesale_cost double
ss_ext_list_price double
ss_ext_tax double
ss_coupon_amt double
ss_net_paid double
ss_net_paid_inc_tax double
ss_net_profit double
# Partition Information
# col_name data_type comment
ss_sold_date_sk bigint
# Detailed Table Information
Database: tpcds_bin_partitioned_parquet_200
Owner: root
CreateTime: Tue Jun 06 11:51:48 CST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location:
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 2023
numPartitions 1824
numRows 575995635
rawDataSize 12671903970
totalSize 46465926745
transient_lastDdlTime 1496721108
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G /tmp/tpcds-generate/200/store_sales
{noformat}
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}
It seems that the rawDataSize is nearly 75G but in "describe formatted
store_sales" command, it shows only 12G.
I tried to use "analyze table store_sales compute statistics for columns" to
update the statistics but there is no change for RAWDATASIZE;
I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute
statistics no scan" to update the statistics but fail, the error is
{code}
{code}
> Invalid statistics like "RAW DATA SIZE" info for parquet file
> -------------------------------------------------------------
>
> Key: HIVE-17182
> URL: https://issues.apache.org/jira/browse/HIVE-17182
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
>
> on TPC-DS 200g scale store_sales
> use "describe formatted store_sales" to view the statistics
> {code}
> hive> describe formatted store_sales;
> OK
> # col_name data_type comment
>
> ss_sold_time_sk bigint
> ss_item_sk bigint
> ss_customer_sk bigint
> ss_cdemo_sk bigint
> ss_hdemo_sk bigint
> ss_addr_sk bigint
> ss_store_sk bigint
> ss_promo_sk bigint
> ss_ticket_number bigint
> ss_quantity int
> ss_wholesale_cost double
> ss_list_price double
> ss_sales_price double
> ss_ext_discount_amt double
> ss_ext_sales_price double
> ss_ext_wholesale_cost double
> ss_ext_list_price double
> ss_ext_tax double
> ss_coupon_amt double
> ss_net_paid double
> ss_net_paid_inc_tax double
> ss_net_profit double
>
> # Partition Information
> # col_name data_type comment
>
> ss_sold_date_sk bigint
>
> # Detailed Table Information
> Database: tpcds_bin_partitioned_parquet_200
> Owner: root
> CreateTime: Tue Jun 06 11:51:48 CST 2017
> LastAccessTime: UNKNOWN
> Retention: 0
> Location:
> hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
>
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
> numFiles 2023
> numPartitions 1824
> numRows 575995635
> rawDataSize 12671903970
> totalSize 46465926745
> transient_lastDdlTime 1496721108
> {code}
> the rawDataSize is nearly 12G while the totalSize is nearly 46G.
> view the original data on hdfs
> {noformat}
> #hadoop fs -du -h /tmp/tpcds-generate/200/
> 75.8 G /tmp/tpcds-generate/200/store_sales
> {noformat}
> view the parquet file on hdfs
> {noformat}
> # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
> 43.3 G /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
> {noformat}
> It seems that the rawDataSize is nearly 75G but in "describe formatted
> store_sales" command, it shows only 12G.
> I tried to use "analyze table store_sales compute statistics for columns" to
> update the statistics but there is no change for RAWDATASIZE;
> I tried to use "analyze table store_sales partition(ss_sold_date_sk) compute
> statistics no scan" to update the statistics but fail, the error is
> {code}
> 2017-09-28T03:21:04,849 INFO [StatsNoJobTask-Thread-1] exec.Task: [Warning]
> could not update stats for
> tpcds_bin_partitioned_parquet_10.store_sales{ss_sold_date_sk=2451769}.Failed
> with exception Missing timezone id for parquet int96 conversion!
> java.lang.IllegalArgumentException: Missing timezone id for p^Carquet int96
> conversion!
> at
> org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.validateTimeZone(NanoTimeUtils.java:169)
> at
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setTimeZoneConversion(ParquetRecordReaderBase.java:182)
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:89)
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:59)
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:86)
> at
> org.apache.hadoop.hive.ql.exec.StatsNoJobTask$StatsCollection.run(StatsNoJobTask.java:164)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> a bug is tracked by HIVE-12767
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)