[ https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-28413: ---------------------------- Fix Version/s: 3.0.0 > sizeInByte is Not updated for parquet datasource on Next Insert. > ---------------------------------------------------------------- > > Key: SPARK-28413 > URL: https://issues.apache.org/jira/browse/SPARK-28413 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2, 2.4.1 > Reporter: Babulal > Priority: Minor > Fix For: 3.0.0 > > > In SPARK-21237 (link SPARK-21237) it is fix when Appending data using > write.mode("append") . But when create same type of parquet table using SQL > and Insert data ,stats shows in-correct (not updated). > *+Correct Stats Example (SPARK-21237)+* > scala> spark.range(100).write.saveAsTable("tab1") > scala> spark.sql("explain cost select * from tab1").show(false) > +------------------------------------------------------------------------ > |plan > +------------------------------------------------------------------------| > |== Optimized Logical Plan == > Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)| > == Physical Plan == > FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet, > scala> spark.range(100).write.mode("append").saveAsTable("tab1") > scala> spark.sql("explain cost select * from tab1").show(false) > +---------------------------------------------------------------------- > |plan > +----------------------------------------------------------------------| > |== Optimized Logical Plan == > Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, > hints=none)| > == Physical Plan == > FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet, > > > +*Incorrect Stats Example*+ > scala> spark.sql("create table tab2(id bigint) using parquet") > res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("explain cost select * from tab2").show(false) > +---------------------------------------------------------------------- > |plan > +----------------------------------------------------------------------| > |== Optimized Logical Plan == > Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)| > == Physical Plan == > FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, > > scala> spark.sql("insert into tab2 select 1") > res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("explain cost select * from tab2").show(false) > +---------------------------------------------------------------------- > |plan > +----------------------------------------------------------------------| > |== Optimized Logical Plan == > Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff0000}374.0 > B{color}*, hints=none)| > == Physical Plan == > FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet, > > > Both table are same type of table > scala> spark.sql("desc formatted tab1").show(2000,false) > > +-----------------------------+-------------------------------------------------------------+ > |col_name|data_type| > +-----------------------------+-------------------------------------------------------------+ > |id|bigint| > | | | > | # Detailed Table Information| | > |Database|default| > |Table|tab1| > |Owner|Administrator| > |Created Time|Tue Jul 16 21:08:35 IST 2019| > |Last Access|Thu Jan 01 05:30:00 IST 1970| > |Created By|Spark 2.3.2| > |Type|MANAGED| > |Provider|parquet| > |Table Properties|[transient_lastDdlTime=1563291579]| > |Statistics|1568 bytes| > |Location|file:/x/2| > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > > scala> spark.sql("desc formatted tab2").show(2000,false) > > +-----------------------------+------------------------------------------------------------- > |col_name|data_type > > +-----------------------------+-------------------------------------------------------------| > |id|bigint| > | | > | # Detailed Table Information| > |Database|default| > |Table|tab2| > |Owner|Administrator| > |Created Time|Tue Jul 16 21:10:24 IST 2019| > |Last Access|Thu Jan 01 05:30:00 IST 1970| > |Created By|Spark 2.3.2| > |Type|MANAGED| > |Provider|parquet| > |Table Properties|[transient_lastDdlTime=1563291624]| > |Location|file:/x/1| > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org