[ 
https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szurap updated SPARK-47444:
----------------------------------
    Attachment: reproduction_steps_SPARK-47444.txt

> Empty numRows table stats should not break Hive tables
> ------------------------------------------------------
>
>                 Key: SPARK-47444
>                 URL: https://issues.apache.org/jira/browse/SPARK-47444
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.8
>            Reporter: Miklos Szurap
>            Priority: Major
>              Labels: Hive, HiveMetaStoreClient, SQL
>         Attachments: reproduction_steps_SPARK-47444.txt
>
>
> A Hive table cannot be accessed / queried / updated from Spark (it is 
> completely "broken") if the "numRows" table property (table stat) is 
> populated with a non-numeric value (like an empty string). Accessing the able 
> from spark results in a "NumberFormatException":
> {code}
> scala> spark.sql("select * from t1p").show()
> java.lang.NumberFormatException: Zero length BigInteger
>   at java.math.BigInteger.<init>(BigInteger.java:420)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
> ...
> {code}
> or
> similarly just with
> {code}
> java.lang.NumberFormatException: For input string: "Foo"
> {code}
> Currently the table stats can be broken through Spark with
> {code}
> scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
> 'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
> {code}
>  
> Spark should:
> 1. Validate sparkSQL "alter table" statements and not allow non-numeric 
> values in the "totalSize", "numRows", "rawDataSize" table properties, as 
> those are checked in the 
> [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
> 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong 
> "totalSize", "numRows", "rawDataSize" table properties and not fail with a 
> cryptic NumberFormatException, but treat those as zero. Or at least it should 
> provide a clue in the error message which table property is incorrect.
> Note: beeline/Hive validates alter table statements, however Impala can 
> similarly break the table, the above item #1 needs to be fixed there too.
> I have checked only the Spark 2.4.x behavior, the same probably exists in 
> Spark 3.x too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to