Miklos Szurap created SPARK-47444:
-------------------------------------
Summary: Empty numRows table stats should not break Hive tables
Key: SPARK-47444
URL: https://issues.apache.org/jira/browse/SPARK-47444
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.8
Reporter: Miklos Szurap
A Hive table cannot be accessed / queried / updated from Spark (it is
completely "broken") if the "numRows" table property (table stat) is populated
with a non-numeric value (like an empty string). Accessing the able from spark
results in a "NumberFormatException":
{code}
scala> spark.sql("select * from t1p").show()
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.<init>(BigInteger.java:420)
...
at
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
...
at
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
...
{code}
or
similarly just with
{code}
java.lang.NumberFormatException: For input string: "Foo"
{code}
Currently the table stats can be broken through Spark with
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='',
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
Spark should:
1. Validate sparkSQL "alter table" statements and not allow non-numeric values
in the "totalSize", "numRows", "rawDataSize" table properties, as those are
checked in the
[HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
2. The HiveClientImpl#readHiveStats should probably tolerate these wrong
"totalSize", "numRows", "rawDataSize" table properties and not fail with a
cryptic NumberFormatException, but treat those as zero. Or at least it should
provide a clue in the error message which table property is incorrect.
Note: beeline/Hive validates alter table statements, however Impala can
similarly break the table, the above item #1 needs to be fixed there too.
I have checked only the Spark 2.4.x behavior, the same probably exists in Spark
3.x too.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]