Nicholas Chammas created SPARK-15193: ----------------------------------------
Summary: samplingRatio should default to 1.0 across the board Key: SPARK-15193 URL: https://issues.apache.org/jira/browse/SPARK-15193 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor The default sampling ratio for {{jsonRDD}} is [1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD], whereas for {{createDataFrame}} it's [{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame]. I think the default sampling ratio should be 1.0 across the board. Users should have to explicitly supply a lower sampling ratio if they know their dataset has a consistent structure. Otherwise, I think the "safer" thing to default to is to check all the data. Targeting this for 2.0 in case we consider it a breaking change that would be more difficult to get in later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org