[jira] [Updated] (HIVE-20574) Column statistics give erraneous numDistinct

JIRA Mon, 17 Sep 2018 01:36:02 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-20574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikko Kivistö updated HIVE-20574:
---------------------------------
    Description: 
1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using 
some tool (aws cli, hdfs command or anything)
   - S3: s3://www.smartdatahub.io/data/test.parquet
   - HTTP: [http://www.smartdatahub.io/data/test.parquet]
   - or the attachmen

eg. with aws cli, wget/curl/distcp can also be used

{\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }}

{\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}

{{ hdfs:///tmp/testi_parquet/test.parquet}}

2) Create table default.testi_parquet2 on top of that using the schema provided

{{CREATE TABLE `default.testi_parquet2`(}}
{{   `rakennustu` int, }}
{{   `kohdenimi` string, }}
{{   `tekstisuun` int, }}
{{   `tekstikoko` float, }}
{{   `tekstifont` string, }}
{{   `buix_bid` int, }}
{{   `paivitetty` string, }}
{{   `datanomist` string, }}
{{   `geom_geojson` string, }}
{{   `geom` binary, }}
{{   `extractdate` string)}}
{{ ROW FORMAT SERDE }}
{{   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
{{ STORED AS INPUTFORMAT }}
{{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
{{ OUTPUTFORMAT }}
{{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
{{ LOCATION}}
{{   'hdfs:///tmp/testi_parquet/';}}
{{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM 
STEP 1}}



3) To collect the values showing you the actual reality of the data: Query the 
distinct count, min and max of column "tekstisuun"

{{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM 
default.testi_parquet2; }}

and note them  (min 0, max 0, distinct 1)
 4) Compute statistics for the table using

{{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}

5) See erroneous statistics entry for numDistincts: Query the statistics by 
using "

{{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}

" and note the ERRANEOUS numDistincts value: 2

  was:
1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using 
some tool (aws cli, hdfs command or anything)
   - S3: s3://www.smartdatahub.io/data/test.parquet
   - HTTP: [http://www.smartdatahub.io/data/test.parquet]
   - or the attachmen

eg. with aws cli, wget/curl/distcp can also be used

{{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }}

{{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}

{{ hdfs:///tmp/testi_parquet/test.parquet}}

2) Create table default.testi_parquet2 on top of that using the schema provided

{\{ CREATE TABLE `default.testi_parquet2`(}}
 \{{   `rakennustu` int, }}
 \{{   `kohdenimi` string, }}
 \{{   `tekstisuun` int, }}
 \{{   `tekstikoko` float, }}
 \{{   `tekstifont` string, }}
 \{{   `buix_bid` int, }}
 \{{   `paivitetty` string, }}
 \{{   `datanomist` string, }}
 \{{   `geom_geojson` string, }}
 \{{   `geom` binary, }}
 \{{   `extractdate` string)}}
 \{{ ROW FORMAT SERDE }}
 \{{   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
 \{{ STORED AS INPUTFORMAT }}
 \{{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
 \{{ OUTPUTFORMAT }}
 \{{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
 \{{ LOCATION}}
 \{{   'hdfs:///tmp/testi_parquet/';}}
 \{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM 
STEP 1 IF IT DIFFERS FROM THE EXAMPLE}}

3) To collect the values showing you the actual reality of the data: Query the 
distinct count, min and max of column "tekstisuun"

{\{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM 
default.testi_parquet2; }}

and note them  (min 0, max 0, distinct 1)
 4) Compute statistics for the table using

{{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}

5) See erroneous statistics entry for numDistincts: Query the statistics by 
using "

{{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}

" and note the ERRANEOUS numDistincts value: 2


> Column statistics give erraneous numDistinct
> --------------------------------------------
>
>                 Key: HIVE-20574
>                 URL: https://issues.apache.org/jira/browse/HIVE-20574
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore, Statistics
>    Affects Versions: 2.3.2
>         Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0.
>            Reporter: Mikko Kivistö
>            Priority: Major
>              Labels: Statistics, statsCollection
>
> 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) 
> using some tool (aws cli, hdfs command or anything)
>    - S3: s3://www.smartdatahub.io/data/test.parquet
>    - HTTP: [http://www.smartdatahub.io/data/test.parquet]
>    - or the attachmen
> eg. with aws cli, wget/curl/distcp can also be used
> {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }}
> {\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}
> {{ hdfs:///tmp/testi_parquet/test.parquet}}
> 2) Create table default.testi_parquet2 on top of that using the schema 
> provided
> {{CREATE TABLE `default.testi_parquet2`(}}
> {{   `rakennustu` int, }}
> {{   `kohdenimi` string, }}
> {{   `tekstisuun` int, }}
> {{   `tekstikoko` float, }}
> {{   `tekstifont` string, }}
> {{   `buix_bid` int, }}
> {{   `paivitetty` string, }}
> {{   `datanomist` string, }}
> {{   `geom_geojson` string, }}
> {{   `geom` binary, }}
> {{   `extractdate` string)}}
> {{ ROW FORMAT SERDE }}
> {{   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
> {{ STORED AS INPUTFORMAT }}
> {{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
> {{ OUTPUTFORMAT }}
> {{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
> {{ LOCATION}}
> {{   'hdfs:///tmp/testi_parquet/';}}
> {{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE 
> FROM STEP 1}}
> 3) To collect the values showing you the actual reality of the data: Query 
> the distinct count, min and max of column "tekstisuun"
> {{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM 
> default.testi_parquet2; }}
> and note them  (min 0, max 0, distinct 1)
>  4) Compute statistics for the table using
> {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}
> 5) See erroneous statistics entry for numDistincts: Query the statistics by 
> using "
> {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}
> " and note the ERRANEOUS numDistincts value: 2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20574) Column statistics give erraneous numDistinct

Reply via email to