[spark] branch branch-2.4 updated: [SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV

gurwls223 Wed, 16 Sep 2020 04:35:47 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new ce545d1  [SPARK-32888][DOCS] Add user document about header flag and 
RDD as path for reading CSV
ce545d1 is described below

commit ce545d18b9c3d1ff908807c11c2a8273cc21e607
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed Sep 16 20:16:15 2020 +0900

    [SPARK-32888][DOCS] Add user document about header flag and RDD as path for 
reading CSV
    
    ### What changes were proposed in this pull request?
    
    This proposes to enhance user document of the API for loading a Dataset of 
strings storing CSV rows. If the header option is set to true, the API will 
remove all lines same with the header.
    
    ### Why are the changes needed?
    
    This behavior can confuse users. We should explicitly document it.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Only doc change.
    
    ### How was this patch tested?
    
    Only doc change.
    
    Closes #29765 from viirya/SPARK-32888.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    (cherry picked from commit 550c1c9cfb5e6439cdd835388fe90a9ca1ebc695)
    Signed-off-by: HyukjinKwon <[email protected]>
---
 python/pyspark/sql/readwriter.py                                   | 3 +++
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index a36ddf9..c95492b 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -374,6 +374,9 @@ class DataFrameReader(OptionUtils):
                         character. By default (None), it is disabled.
         :param header: uses the first line as names of columns. If None is 
set, it uses the
                        default value, ``false``.
+                       .. note:: if the given path is a RDD of Strings, this 
header
+                       option will remove all lines same with the header if 
exists.
+
         :param inferSchema: infers the input schema automatically from data. 
It requires one extra
                        pass over the data. If None is set, it uses the default 
value, ``false``.
         :param enforceSchema: If it is set to ``true``, the specified or 
inferred schema will be
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index ce0a4e8..ac4654f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -500,6 +500,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * If the enforceSchema is set to `false`, only the CSV header in the first 
line is checked
    * to conform specified or inferred schema.
    *
+   * @note if `header` option is set to `true` when calling this API, all 
lines same with
+   * the header will be removed if exists.
+   *
    * @param csvDataset input Dataset with one CSV row per record
    * @since 2.2.0
    */


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-2.4 updated: [SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV

Reply via email to