Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21556#discussion_r200419939
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
    @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
           (n: String, v: Any) => FilterApi.eq(
             intColumn(n),
             Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
    +
    +    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
    --- End diff --
    
    That doesn't validate the value against the decimal scale from the file, 
which is what I'm suggesting. The decimal scale must match exactly and this is 
a good place to check because this has the file information. If the scale 
doesn't match, then the schema used to read this file is incorrect, which would 
cause data corruption.
    
    In my opinion, it is better to add a check if it is cheap instead of 
debating whether or not some other part of the code covers the case. If this 
were happening per record then I would opt for a different strategy, but 
because this is at the file level it is a good idea to add it here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to