[ 
https://issues.apache.org/jira/browse/HIVE-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839684#comment-13839684
 ] 

Prasanth J commented on HIVE-5936:
----------------------------------

Even ROW_COUNT and RAW_DATA_SIZE is not reliable. Following sequence of 
operations illustrate it
{code}
hive> create table test (key string, value string);
OK
Time taken: 0.069 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' 
into table test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 0.231 seconds
hive> desc formatted test;
OK
# col_name              data_type               comment             
                 
key                     string                  None                
value                   string                  None                
                 
# Detailed Table Information             
Database:               default                  
Owner:                  pjayachandran            
CreateTime:             Wed Dec 04 17:31:32 PST 2013     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/tmp/warehouse/test         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               5812                
        transient_lastDdlTime   1386207121          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.094 seconds, Fetched: 32 row(s)
hive> drop table test;
OK
Time taken: 0.423 seconds
hive> set hive.stats.autogather=false;
hive> create table test (key string, value string);                             
            
OK
Time taken: 0.03 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' 
into table test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
OK
Time taken: 0.097 seconds
hive> desc formatted test;                                                      
            
OK
# col_name              data_type               comment             
                 
key                     string                  None                
value                   string                  None                
                 
# Detailed Table Information             
Database:               default                  
Owner:                  pjayachandran            
CreateTime:             Wed Dec 04 17:32:29 PST 2013     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/tmp/warehouse/test         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   false               
        numFiles                1                   
        numRows                 -1                  
        rawDataSize             -1                  
        totalSize               5812                
        transient_lastDdlTime   1386207152          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.061 seconds, Fetched: 32 row(s)
hive> set hive.stats.collect.rawdatasize=false;                                 
            
hive> analyze table test compute statistics;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Listening for transport dt_socket at address: 65378
2013-12-04 17:35:55.379 java[81428:1003] Unable to load realm info from 
SCDynamicStore
Execution log at: 
/var/folders/2w/4x52xg597k50_bt27x3_k9tw0000gn/T//pjayachandran/pjayachandran_20131204173535_82f7e5c3-0016-4a63-a89c-e07b6ed07ab4.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-12-04 17:35:57,347 null map = 0%,  reduce = 0%
2013-12-04 17:36:14,366 null map = 100%,  reduce = 0%
Ended Job = job_local124477567_0001
Execution completed successfully
MapredLocal task succeeded
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 36.769 seconds
hive> desc formatted test;                     
OK
# col_name              data_type               comment             
                 
key                     string                  None                
value                   string                  None                
                 
# Detailed Table Information             
Database:               default                  
Owner:                  pjayachandran            
CreateTime:             Wed Dec 04 17:32:29 PST 2013     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               file:/tmp/warehouse/test         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                1                   
        numRows                 500                 
        rawDataSize             0                   
        totalSize               5812                
        transient_lastDdlTime   1386207374          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.064 seconds, Fetched: 32 row(s)
hive> 
{code}

As seen above, statistics are different when autostats gathering is enabled vs 
disabled. Also, not all SerDes support RAW_DATA_SIZE. AFAIK, LazySimpleSerde 
and ORC supports RAW_DATA_SIZE. LazySimpleSerde supports RAW_DATA_SIZE during 
INSERT operation and ANALYZE. But ORC supports only during INSERT operation. 
Since there are multiple codepaths/ways stats can be updated I do not think 
RAW_DATA_SIZE and ROW_COUNT is reliable always. 

Following code segment is removed in HIVE-5921
{code}
if (nr < 0) {
  nr = 0;
}
{code}
instead if ROW_COUNT is <=0, the number of rows will be estimated based on 
average row size computed from schema
{code}
      if (nr <= 0) {
        nr = 0;
        int avgRowSize = estimateRowSizeFromSchema(conf, schema, neededColumns);
        if (avgRowSize > 0) {
          nr = ds / avgRowSize;
        }
       }
{code}

There is another subtask HIVE-5949 which will have a flag to say if the 
statistics is accurate (all statistics are from metastore) or estimated. 

> analyze command failing to collect stats with counter mechanism
> ---------------------------------------------------------------
>
>                 Key: HIVE-5936
>                 URL: https://issues.apache.org/jira/browse/HIVE-5936
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 0.13.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Navis
>         Attachments: HIVE-5936.1.patch.txt, HIVE-5936.2.patch.txt
>
>
> With counter mechanism, MR job is successful, but StatsTask on client fails 
> with NPE.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to