Nicholas Chammas created SPARK-15193:
----------------------------------------
Summary: samplingRatio should default to 1.0 across the board
Key: SPARK-15193
URL: https://issues.apache.org/jira/browse/SPARK-15193
Project: Spark
Issue Type: Improvement
Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor
The default sampling ratio for {{jsonRDD}} is
[1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
whereas for {{createDataFrame}} it's
[{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].
I think the default sampling ratio should be 1.0 across the board. Users should
have to explicitly supply a lower sampling ratio if they know their dataset has
a consistent structure. Otherwise, I think the "safer" thing to default to is
to check all the data.
Targeting this for 2.0 in case we consider it a breaking change that would be
more difficult to get in later.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]