[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

MaxGekk Tue, 03 Apr 2018 06:17:59 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    @rxin I made an experiment on json files but numbers for csv are almost the 
same. For example, inferring schema for 50GB json:
    ```
    scala> spark.read.option("samplingRatio", 0.000000001).json("test.json")
    ```
    took 1.7 minute
    ```
    scala> spark.read.option("samplingRatio", 1.0).json("test.json")
    ```
    took 21.9 minutes.
    
    I have looked in a profiler where Spark spends time during schema inferring 
for 50GB json. At least on my laptop - 75% in json parsing and 18% on disk IO. 
Of course, the numbers will be different in a cluster if the files would be 
read from s3 via network. In any way, the samplingRatio option gives us 
opportunity to find a balance of CPUs load and network/disk IO. 
    
    @HyukjinKwon The question is not about workaround, it is about usability:
    
    1. For interactive queries, an user doesn't have to write the boilerplate 
code if there is the option.
    
    2. If the code is used inside of a library, developers don't have to check 
special cases like if it is json use the samplingRatio option otherwise do 
sampling manually.
    
    Additionally the behavior behind of the option could be improved in the 
future. For example, it will require less file reads during sampling. It would 
be easer to do that with the option.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Reply via email to