[GitHub] [spark] MaxGekk commented on a change in pull request #32558: [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON

GitBox Wed, 02 Jun 2021 00:21:22 -0700


MaxGekk commented on a change in pull request #32558:
URL: https://github.com/apache/spark/pull/32558#discussion_r643710679




##########
File path: docs/sql-data-sources-csv.md
##########
@@ -248,5 +248,11 @@ Data source options of CSV can be set via:
     <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, 
<code>gzip</code>, <code>lz4</code>, <code>snappy</code> and 
<code>deflate</code>).</td>
     <td>write</td>
   </tr>
+  <tr>
+    <td><code>inferDateType</code></td>
+    <td>false</td>
+    <td>Infers all DateType format for the CSV. If this is not set, it uses 
the default value, <code>false</code>.</td>

Review comment:
       What do you mean by `all DateType format`? I think we should say about 
the `dateFormat` option here.

##########
File path: docs/sql-data-sources-json.md
##########
@@ -155,6 +155,12 @@ Data source options of JSON can be set via:
     <td>Allows leading zeros in numbers (e.g. 00012). If None is set, it uses 
the default value, <code>false</code>.</td>
     <td>read</td>
   </tr>
+  <tr>
+    <td><code>inferDateType</code></td>
+    <td>false</td>
+    <td>Infers all DateType format for the JSON. If this is not set, it uses 
the default value, <code>false</code>.</td>

Review comment:
       **the** JSON ? Is it needed?

##########
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala
##########
@@ -112,4 +112,14 @@ class JsonInferSchemaSuite extends SparkFunSuite with 
SQLHelper {
     checkType(Map("inferTimestamp" -> "true"), json, TimestampType)
     checkType(Map("inferTimestamp" -> "false"), json, StringType)
   }
+
+  test("SPARK-34953 - Allow DateType format while inferring") {

Review comment:
       ```suggestion
     test("SPARK-34953: Allow DateType format while inferring") {
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
##########
@@ -109,6 +116,7 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
         case LongType => tryParseLong(field)
         case _: DecimalType => tryParseDecimal(field)
         case DoubleType => tryParseDouble(field)
+        case DateType => tryParseDateFormat(field)

Review comment:
       Just curious why you try to infer dates before timestamps?

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -206,6 +206,12 @@ class CSVOptions(
     sep
   }
 
+  /**
+   * option to infer date Type in the schema

Review comment:
       For me as an user, it is not clear the relation between the 
`inferSchema` option and this one. Does this option enable inferring 
independently from inferSchema`?

##########
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
##########
@@ -192,4 +192,27 @@ class CSVInferSchemaSuite extends SparkFunSuite with 
SQLHelper {
     Seq("en-US").foreach(checkDecimalInfer(_, StringType))
     Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 
0)))
   }
+
+  test("SPARK-34953 - DateType should be inferred when user defined format are 
provided") {

Review comment:
       The common convention is: `SPARK-XXXXX: ...`
   ```suggestion
     test("SPARK-34953: DateType should be inferred when user defined format 
are provided") {
   ```
   

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
##########
@@ -160,6 +168,16 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
   private def tryParseDouble(field: String): DataType = {
     if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field)) {
       DoubleType
+    } else {
+      tryParseDateFormat(field)
+    }
+  }
+
+  private def tryParseDateFormat(field: String): DataType = {
+    if (options.inferDateType
+      && !dateFormatter.isInstanceOf[LegacySimpleDateFormatter]

Review comment:
       I didn't get the check. Could you explain, please, why do you avoid the 
formatter?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #32558: [SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON

Reply via email to