This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push:
new ce545d1 [SPARK-32888][DOCS] Add user document about header flag and
RDD as path for reading CSV
ce545d1 is described below
commit ce545d18b9c3d1ff908807c11c2a8273cc21e607
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed Sep 16 20:16:15 2020 +0900
[SPARK-32888][DOCS] Add user document about header flag and RDD as path for
reading CSV
### What changes were proposed in this pull request?
This proposes to enhance user document of the API for loading a Dataset of
strings storing CSV rows. If the header option is set to true, the API will
remove all lines same with the header.
### Why are the changes needed?
This behavior can confuse users. We should explicitly document it.
### Does this PR introduce _any_ user-facing change?
No. Only doc change.
### How was this patch tested?
Only doc change.
Closes #29765 from viirya/SPARK-32888.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 550c1c9cfb5e6439cdd835388fe90a9ca1ebc695)
Signed-off-by: HyukjinKwon <[email protected]>
---
python/pyspark/sql/readwriter.py | 3 +++
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 3 +++
2 files changed, 6 insertions(+)
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index a36ddf9..c95492b 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -374,6 +374,9 @@ class DataFrameReader(OptionUtils):
character. By default (None), it is disabled.
:param header: uses the first line as names of columns. If None is
set, it uses the
default value, ``false``.
+ .. note:: if the given path is a RDD of Strings, this
header
+ option will remove all lines same with the header if
exists.
+
:param inferSchema: infers the input schema automatically from data.
It requires one extra
pass over the data. If None is set, it uses the default
value, ``false``.
:param enforceSchema: If it is set to ``true``, the specified or
inferred schema will be
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index ce0a4e8..ac4654f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -500,6 +500,9 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
* If the enforceSchema is set to `false`, only the CSV header in the first
line is checked
* to conform specified or inferred schema.
*
+ * @note if `header` option is set to `true` when calling this API, all
lines same with
+ * the header will be removed if exists.
+ *
* @param csvDataset input Dataset with one CSV row per record
* @since 2.2.0
*/
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]