[GitHub] [spark] MaxGekk opened a new pull request #28481: [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps

GitBox Fri, 08 May 2020 11:58:08 -0700


MaxGekk opened a new pull request #28481:
URL: https://github.com/apache/spark/pull/28481



   ### What changes were proposed in this pull request?
   Modified `RandomDataGenerator.forType` for DateType and TimestampType to 
generate special date//timestamp values with 0.5 probability. This will trigger 
dictionary encoding in Parquet datasource test  HadoopFsRelationTest "test all 
data types". Currently, dictionary encoding is tested only for numeric types 
like ShortType.
   
   ### Why are the changes needed?
   To extend test coverage. Currently, probability of testing of dictionary 
encoding in the test HadoopFsRelationTest "test all data types" for DateType 
and TimestampType is close to zero because dates/timestamps are uniformly 
distributed in wide range, and the chance of generating the same values is 
pretty low. In this way, parquet datasource cannot apply dictionary encoding 
for such column types.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk opened a new pull request #28481: [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps

Reply via email to