[GitHub] [spark] tinhto-000 opened a new pull request #29419: [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms

GitBox Wed, 12 Aug 2020 09:30:08 -0700


tinhto-000 opened a new pull request #29419:
URL: https://github.com/apache/spark/pull/29419



   ### What changes were proposed in this pull request?
   (back-porting from 
https://github.com/apache/spark/commit/9a3811dbf5f1234c1587337a3d74823d1d163b53)
   
   This PR fixes the issue introduced during SPARK-26985.
   
   SPARK-26985 changes the `putDoubles()` and `putFloats()` methods to respect 
the platform's endian-ness.  However, that causes the RLE paths in 
VectorizedRleValuesReader.java to read the RLE entries in parquet as BIG_ENDIAN 
on big endian platforms (i.e., as is), even though parquet data is always in 
little endian format.
   
   The comments in `WriteableColumnVector.java` say those methods are used for 
"ieee formatted doubles in platform native endian" (or floats), but since the 
data in parquet is always in little endian format, use of those methods appears 
to be inappropriate.
   
   To demonstrate the problem with spark-shell:
   
   ```scala
   import org.apache.spark._
   import org.apache.spark.sql._
   import org.apache.spark.sql.types._
       
   var data = Seq(
     (1.0, 0.1),
     (2.0, 0.2),
     (0.3, 3.0),
     (4.0, 4.0),
     (5.0, 5.0))
     
   var df = 
spark.createDataFrame(data).write.mode(SaveMode.Overwrite).parquet("/tmp/data.parquet2")
   var df2 = spark.read.parquet("/tmp/data.parquet2")
   df2.show()
   ```
   
   result:
   
   ```scala
   +--------------------+--------------------+                                  
   
   |                  _1|                  _2|
   +--------------------+--------------------+
   |           3.16E-322|-1.54234871366845...|
   |         2.0553E-320|         2.0553E-320|
   |          2.561E-320|          2.561E-320|
   |4.66726145843124E-62|         1.0435E-320|
   |        3.03865E-319|-1.54234871366757...|
   +--------------------+--------------------+
   ```
   
   Also tests in ParquetIOSuite that involve float/double data would fail, 
e.g., 
   
   - basic data types (without binary)
   - read raw Parquet file
   
   /examples/src/main/python/mllib/isotonic_regression_example.py would fail as 
well.
   
   Purposed code change is to add `putDoublesLittleEndian()` and 
`putFloatsLittleEndian()` methods for parquet to invoke, just like the existing 
`putIntsLittleEndian()` and `putLongsLittleEndian()`.  On little endian 
platforms they would call `putDoubles()` and `putFloats()`, on big endian they 
would read the entries as little endian like pre-SPARK-26985.
   
   No new unit-test is introduced as the existing ones are actually sufficient.
   
   ### Why are the changes needed?
   RLE float/double data in parquet files will not be read back correctly on 
big endian platforms.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   All unit tests (mvn test) were ran and OK.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tinhto-000 opened a new pull request #29419: [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms

Reply via email to