[GitHub] [spark] kimtkyeom commented on a change in pull request #27888: [SPARK-31116][SQL] Consider case sensitivity in ParquetRowConverter

GitBox Thu, 12 Mar 2020 03:18:14 -0700

kimtkyeom commented on a change in pull request #27888: [SPARK-31116][SQL] 
Consider case sensitivity in ParquetRowConverter
URL: https://github.com/apache/spark/pull/27888#discussion_r391521516


 ##########
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ##########
 @@ -804,6 +804,162 @@ abstract class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedS
     }
   }
 
+  test("SPARK-31116: Select simple parquet columns correctly in case 
insensitive manner") {
 
 Review comment:
   I tested ORC and Json file format and there exist some failures.
   
   ## Json test failure
   Json passed case sensitive cases, but it failed in case insensitive case
   ```
   [info] - SPARK-31116: Select simple columns correctly in case insensitive 
manner *** FAILED *** (4 seconds, 277 milliseconds)
   [info]   Results do not match for query:
   [info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
   [info]   Timezone Env:
   [info]
   [info]   == Parsed Logical Plan ==
   [info]   Relation[camelcase#56] json
   [info]
   [info]   == Analyzed Logical Plan ==
   [info]   camelcase: string
   [info]   Relation[camelcase#56] json
   [info]
   [info]   == Optimized Logical Plan ==
   [info]   Relation[camelcase#56] json
   [info]
   [info]   == Physical Plan ==
   [info]   FileScan json [camelcase#56] Batched: false, DataFilters: [], 
Format: JSON, Location: 
InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-95f1357a-85c9-444f-bdcc-...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<camelcase:string>
   [info]
   [info]   == Results ==
   [info]
   [info]   == Results ==
   [info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
   [info]   !struct<>                   struct<camelcase:string>
   [info]   ![A]                        [null] (QueryTest.scala:248)
   ```
   
   ```
   [info] - SPARK-31116: Select nested columns correctly in case insensitive 
manner *** FAILED *** (2 seconds, 117 milliseconds)
   [info]   Results do not match for query:
   [info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
   [info]   Timezone Env:
   [info]
   [info]   == Parsed Logical Plan ==
   [info]   Relation[StructColumn#147] json
   [info]
   [info]   == Analyzed Logical Plan ==
   [info]   StructColumn: struct<LowerCase:bigint,camelcase:bigint>
   [info]   Relation[StructColumn#147] json
   [info]
   [info]   == Optimized Logical Plan ==
   [info]   Relation[StructColumn#147] json
   [info]
   [info]   == Physical Plan ==
   [info]   FileScan json [StructColumn#147] Batched: false, DataFilters: [], 
Format: JSON, Location: 
InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-f9ecd1a4-e5aa-4dd7-bdfd-...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
   [info]
   [info]   == Results ==
   [info]
   [info]   == Results ==
   [info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
   [info]   !struct<>                   
struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
   [info]   ![[0,1]]                    [[null,null]] (QueryTest.scala:248)
   ```
   ## ORC test failure
   ORC passed case insensitive test cases, but it failed case insensitive 
manner.
   ```
   [info] - SPARK-31116: Select nested columns correctly in case sensitive 
manner *** FAILED *** (871 milliseconds)
   [info]   Results do not match for query:
   [info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
   [info]   Timezone Env:
   [info]
   [info]   == Parsed Logical Plan ==
   [info]   Relation[StructColumn#329] json
   [info]
   [info]   == Analyzed Logical Plan ==
   [info]   StructColumn: struct<LowerCase:bigint,camelcase:bigint>
   [info]   Relation[StructColumn#329] json
   [info]
   [info]   == Optimized Logical Plan ==
   [info]   Relation[StructColumn#329] json
   [info]
   [info]   == Physical Plan ==
   [info]   FileScan json [StructColumn#329] Batched: false, DataFilters: [], 
Format: JSON, Location: 
InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-612baf76-a9d0-41e5-89f4-...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
   [info]
   [info]   == Results ==
   [info]
   [info]   == Results ==
   [info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
   [info]   !struct<>                   
struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
   [info]   ![null]                     [[null,null]] (QueryTest.scala:248)
   ```
   But i think ORC failure is due to difference between materializing Row. Is 
there clean way to test properly? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] kimtkyeom commented on a change in pull request #27888: [SPARK-31116][SQL] Consider case sensitivity in ParquetRowConverter

Reply via email to