Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14712
  
    Very recently, Hive 2.0.0 fixed a serious bug. See the JIRA: 
https://issues.apache.org/jira/browse/HIVE-12661 
    
    We can get a wrong result when the statistics are out of dated (i.e., 
wrong). I can easily reproduce it.
    
    Hive made a few changes in `COLUMN_STATS_ACCURATE`. Before Hive 2.0.0, it 
is like
    ```
    COLUMN_STATS_ACCURATE true
    ```
    After the fix, this becomes different.
    ```
    COLUMN_STATS_ACCURATE 
{"COLUMN_STATS":{"key":"true","value":"true"},"BASIC_STATS":"true"}
    ```
    
    I start worrying about the statistics value we populate might affect the 
query results through Hive interface, especially when we set 
`STATS_GENERATED_VIA_STATS_TASK`. 
    
    We are unable to implement a concurrency control between Hive and Spark. If 
we use the same names, `statistics` is like a shared memory space. Both Hive 
and Spark can modify it without notice. I have not found the bug, but it sounds 
risky.
    
    Conceptually, setting `STATS_GENERATED_VIA_STATS_TASK` is wrong. As @wzhfy 
pointed out, `numRows` is always `-1` if we do not set it. That also indicates 
`numRows` is not allowed external users to set it, right? 
    
    I am not very confident that we can implement a very stable solution for 
sharing Spark-generated statistics with Hive, since Hive is out of our control. 
However, Spark should be able to leverage Hive-generated statistics and the 
statistics we generated in this CBO work. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to