[
https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Miklos Szurap updated SPARK-47444:
----------------------------------
Attachment: reproduction_steps_SPARK-47444.txt
> Empty numRows table stats should not break Hive tables
> ------------------------------------------------------
>
> Key: SPARK-47444
> URL: https://issues.apache.org/jira/browse/SPARK-47444
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.8
> Reporter: Miklos Szurap
> Priority: Major
> Labels: Hive, HiveMetaStoreClient, SQL
> Attachments: reproduction_steps_SPARK-47444.txt
>
>
> A Hive table cannot be accessed / queried / updated from Spark (it is
> completely "broken") if the "numRows" table property (table stat) is
> populated with a non-numeric value (like an empty string). Accessing the able
> from spark results in a "NumberFormatException":
> {code}
> scala> spark.sql("select * from t1p").show()
> java.lang.NumberFormatException: Zero length BigInteger
> at java.math.BigInteger.<init>(BigInteger.java:420)
> ...
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
> ...
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
> ...
> {code}
> or
> similarly just with
> {code}
> java.lang.NumberFormatException: For input string: "Foo"
> {code}
> Currently the table stats can be broken through Spark with
> {code}
> scala> spark.sql("alter table t1p set tblproperties('numRows'='',
> 'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
> {code}
>
> Spark should:
> 1. Validate sparkSQL "alter table" statements and not allow non-numeric
> values in the "totalSize", "numRows", "rawDataSize" table properties, as
> those are checked in the
> [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
> 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong
> "totalSize", "numRows", "rawDataSize" table properties and not fail with a
> cryptic NumberFormatException, but treat those as zero. Or at least it should
> provide a clue in the error message which table property is incorrect.
> Note: beeline/Hive validates alter table statements, however Impala can
> similarly break the table, the above item #1 needs to be fixed there too.
> I have checked only the Spark 2.4.x behavior, the same probably exists in
> Spark 3.x too.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]