Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20959 @rxin I made an experiment on json files but numbers for csv are almost the same. For example, inferring schema for 50GB json: ``` scala> spark.read.option("samplingRatio", 0.000000001).json("test.json") ``` took 1.7 minute ``` scala> spark.read.option("samplingRatio", 1.0).json("test.json") ``` took 21.9 minutes. I have looked in a profiler where Spark spends time during schema inferring for 50GB json. At least on my laptop - 75% in json parsing and 18% on disk IO. Of course, the numbers will be different in a cluster if the files would be read from s3 via network. In any way, the samplingRatio option gives us opportunity to find a balance of CPUs load and network/disk IO. @HyukjinKwon The question is not about workaround, it is about usability: 1. For interactive queries, an user doesn't have to write the boilerplate code if there is the option. 2. If the code is used inside of a library, developers don't have to check special cases like if it is json use the samplingRatio option otherwise do sampling manually. Additionally the behavior behind of the option could be improved in the future. For example, it will require less file reads during sampling. It would be easer to do that with the option.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org