Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r180535344
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with
SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation
< 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio",
0.1).json(path.getCanonicalPath)
--- End diff --
It seems specifying only `spark.sql.files.maxPartitionBytes` is not enough.
Please, look at the
[formula](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L406)
and [slicing input
files](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L415):
```
val maxSplitBytes = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
```
Is ok if I just check that file size is less than `maxSplitBytes`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]