[ https://issues.apache.org/jira/browse/HIVE-20574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikko Kivistö updated HIVE-20574: --------------------------------- Description: 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything) - S3: s3://www.smartdatahub.io/data/test.parquet - HTTP: [http://www.smartdatahub.io/data/test.parquet] - or the attachmen eg. with aws cli, wget/curl/distcp can also be used {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }} {\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }} {{ hdfs:///tmp/testi_parquet/test.parquet}} 2) Create table default.testi_parquet2 on top of that using the schema provided {{CREATE TABLE `default.testi_parquet2`(}} {{ `rakennustu` int, }} {{ `kohdenimi` string, }} {{ `tekstisuun` int, }} {{ `tekstikoko` float, }} {{ `tekstifont` string, }} {{ `buix_bid` int, }} {{ `paivitetty` string, }} {{ `datanomist` string, }} {{ `geom_geojson` string, }} {{ `geom` binary, }} {{ `extractdate` string)}} {{ ROW FORMAT SERDE }} {{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} {{ STORED AS INPUTFORMAT }} {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} {{ OUTPUTFORMAT }} {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} {{ LOCATION}} {{ 'hdfs:///tmp/testi_parquet/';}} {{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1}} 3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun" {{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }} and note them (min 0, max 0, distinct 1) 4) Compute statistics for the table using {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} 5) See erroneous statistics entry for numDistincts: Query the statistics by using " {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} " and note the ERRANEOUS numDistincts value: 2 was: 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything) - S3: s3://www.smartdatahub.io/data/test.parquet - HTTP: [http://www.smartdatahub.io/data/test.parquet] - or the attachmen eg. with aws cli, wget/curl/distcp can also be used {{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }} {{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }} {{ hdfs:///tmp/testi_parquet/test.parquet}} 2) Create table default.testi_parquet2 on top of that using the schema provided {\{ CREATE TABLE `default.testi_parquet2`(}} \{{ `rakennustu` int, }} \{{ `kohdenimi` string, }} \{{ `tekstisuun` int, }} \{{ `tekstikoko` float, }} \{{ `tekstifont` string, }} \{{ `buix_bid` int, }} \{{ `paivitetty` string, }} \{{ `datanomist` string, }} \{{ `geom_geojson` string, }} \{{ `geom` binary, }} \{{ `extractdate` string)}} \{{ ROW FORMAT SERDE }} \{{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} \{{ STORED AS INPUTFORMAT }} \{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} \{{ OUTPUTFORMAT }} \{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} \{{ LOCATION}} \{{ 'hdfs:///tmp/testi_parquet/';}} \{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1 IF IT DIFFERS FROM THE EXAMPLE}} 3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun" {\{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }} and note them (min 0, max 0, distinct 1) 4) Compute statistics for the table using {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} 5) See erroneous statistics entry for numDistincts: Query the statistics by using " {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} " and note the ERRANEOUS numDistincts value: 2 > Column statistics give erraneous numDistinct > -------------------------------------------- > > Key: HIVE-20574 > URL: https://issues.apache.org/jira/browse/HIVE-20574 > Project: Hive > Issue Type: Bug > Components: Metastore, Statistics > Affects Versions: 2.3.2 > Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0. > Reporter: Mikko Kivistö > Priority: Major > Labels: Statistics, statsCollection > > 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) > using some tool (aws cli, hdfs command or anything) > - S3: s3://www.smartdatahub.io/data/test.parquet > - HTTP: [http://www.smartdatahub.io/data/test.parquet] > - or the attachmen > eg. with aws cli, wget/curl/distcp can also be used > {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }} > {\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }} > {{ hdfs:///tmp/testi_parquet/test.parquet}} > 2) Create table default.testi_parquet2 on top of that using the schema > provided > {{CREATE TABLE `default.testi_parquet2`(}} > {{ `rakennustu` int, }} > {{ `kohdenimi` string, }} > {{ `tekstisuun` int, }} > {{ `tekstikoko` float, }} > {{ `tekstifont` string, }} > {{ `buix_bid` int, }} > {{ `paivitetty` string, }} > {{ `datanomist` string, }} > {{ `geom_geojson` string, }} > {{ `geom` binary, }} > {{ `extractdate` string)}} > {{ ROW FORMAT SERDE }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} > {{ STORED AS INPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} > {{ OUTPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} > {{ LOCATION}} > {{ 'hdfs:///tmp/testi_parquet/';}} > {{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE > FROM STEP 1}} > 3) To collect the values showing you the actual reality of the data: Query > the distinct count, min and max of column "tekstisuun" > {{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM > default.testi_parquet2; }} > and note them (min 0, max 0, distinct 1) > 4) Compute statistics for the table using > {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} > 5) See erroneous statistics entry for numDistincts: Query the statistics by > using " > {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} > " and note the ERRANEOUS numDistincts value: 2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)