[ https://issues.apache.org/jira/browse/HIVE-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839684#comment-13839684 ]
Prasanth J commented on HIVE-5936: ---------------------------------- Even ROW_COUNT and RAW_DATA_SIZE is not reliable. Following sequence of operations illustrate it {code} hive> create table test (key string, value string); OK Time taken: 0.069 seconds hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' into table test; Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt Loading data to table default.test Table default.test stats: [numFiles, numRows, totalSize, rawDataSize] OK Time taken: 0.231 seconds hive> desc formatted test; OK # col_name data_type comment key string None value string None # Detailed Table Information Database: default Owner: pjayachandran CreateTime: Wed Dec 04 17:31:32 PST 2013 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: file:/tmp/warehouse/test Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 0 rawDataSize 0 totalSize 5812 transient_lastDdlTime 1386207121 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.094 seconds, Fetched: 32 row(s) hive> drop table test; OK Time taken: 0.423 seconds hive> set hive.stats.autogather=false; hive> create table test (key string, value string); OK Time taken: 0.03 seconds hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' into table test; Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt Loading data to table default.test OK Time taken: 0.097 seconds hive> desc formatted test; OK # col_name data_type comment key string None value string None # Detailed Table Information Database: default Owner: pjayachandran CreateTime: Wed Dec 04 17:32:29 PST 2013 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: file:/tmp/warehouse/test Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false numFiles 1 numRows -1 rawDataSize -1 totalSize 5812 transient_lastDdlTime 1386207152 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.061 seconds, Fetched: 32 row(s) hive> set hive.stats.collect.rawdatasize=false; hive> analyze table test compute statistics; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true Listening for transport dt_socket at address: 65378 2013-12-04 17:35:55.379 java[81428:1003] Unable to load realm info from SCDynamicStore Execution log at: /var/folders/2w/4x52xg597k50_bt27x3_k9tw0000gn/T//pjayachandran/pjayachandran_20131204173535_82f7e5c3-0016-4a63-a89c-e07b6ed07ab4.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2013-12-04 17:35:57,347 null map = 0%, reduce = 0% 2013-12-04 17:36:14,366 null map = 100%, reduce = 0% Ended Job = job_local124477567_0001 Execution completed successfully MapredLocal task succeeded Table default.test stats: [numFiles, numRows, totalSize, rawDataSize] OK Time taken: 36.769 seconds hive> desc formatted test; OK # col_name data_type comment key string None value string None # Detailed Table Information Database: default Owner: pjayachandran CreateTime: Wed Dec 04 17:32:29 PST 2013 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: file:/tmp/warehouse/test Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 500 rawDataSize 0 totalSize 5812 transient_lastDdlTime 1386207374 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.064 seconds, Fetched: 32 row(s) hive> {code} As seen above, statistics are different when autostats gathering is enabled vs disabled. Also, not all SerDes support RAW_DATA_SIZE. AFAIK, LazySimpleSerde and ORC supports RAW_DATA_SIZE. LazySimpleSerde supports RAW_DATA_SIZE during INSERT operation and ANALYZE. But ORC supports only during INSERT operation. Since there are multiple codepaths/ways stats can be updated I do not think RAW_DATA_SIZE and ROW_COUNT is reliable always. Following code segment is removed in HIVE-5921 {code} if (nr < 0) { nr = 0; } {code} instead if ROW_COUNT is <=0, the number of rows will be estimated based on average row size computed from schema {code} if (nr <= 0) { nr = 0; int avgRowSize = estimateRowSizeFromSchema(conf, schema, neededColumns); if (avgRowSize > 0) { nr = ds / avgRowSize; } } {code} There is another subtask HIVE-5949 which will have a flag to say if the statistics is accurate (all statistics are from metastore) or estimated. > analyze command failing to collect stats with counter mechanism > --------------------------------------------------------------- > > Key: HIVE-5936 > URL: https://issues.apache.org/jira/browse/HIVE-5936 > Project: Hive > Issue Type: Bug > Components: Statistics > Affects Versions: 0.13.0 > Reporter: Ashutosh Chauhan > Assignee: Navis > Attachments: HIVE-5936.1.patch.txt, HIVE-5936.2.patch.txt > > > With counter mechanism, MR job is successful, but StatsTask on client fails > with NPE. -- This message was sent by Atlassian JIRA (v6.1#6144)