subject:"spark git commit\: \[SPARK\-13425\]\[SQL\] Documentation for CSV datasource options"

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a6428292f -> 705172202


[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and 
writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12817 from HyukjinKwon/SPARK-13425.

(cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220

Branch: refs/heads/branch-2.0
Commit: 7051722023b98f1720142c7b3b41948d275ea455
Parents: a642829
Author: hyukjinkwon 
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 19:05:32 2016 -0700

--
 python/pyspark/sql/readwriter.py| 52 
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 --
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
 :param paths: string, or list of strings, for input path(s).
 
+You can set the following CSV-specific options to deal with CSV files:
+* ``sep`` (default ``,``): sets the single character as a 
separator \
+for each field and value.
+* ``charset`` (default ``UTF-8``): decodes the CSV files by the 
given \
+encoding type.
+* ``quote`` (default ``"``): sets the single character used for 
escaping \
+quoted values where the separator can be part of the value.
+* ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+inside an already quoted value.
+* ``comment`` (default empty string): sets the single character 
used for skipping \
+lines beginning with this character. By default, it is 
disabled.
+* ``header`` (default ``false``): uses the first line as names of 
columns.
+* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether 
or not leading \
+whitespaces from values being read should be skipped.
+* ``ignoreTrailingWhiteSpace`` (default ``false``): defines 
whether or not trailing \
+whitespaces from values being read should be skipped.
+* ``nullValue`` (default empty string): sets the string 
representation of a null value.
+* ``nanValue`` (default ``NaN``): sets the string representation 
of a non-number \
+value.
+* ``positiveInf`` (default ``Inf``): sets the string 
representation of a positive \
+infinity value.
+* ``negativeInf`` (default ``-Inf``): sets the string 
representation of a negative \
+infinity value.
+* ``dateFormat`` (default ``None``): sets the string that 
indicates a date format. \
+Custom date formats follow the formats at 
``java.text.SimpleDateFormat``. This \
+applies to both date type and timestamp type. By default, it 
is None which means \
+trying to parse times and date by 
``java.sql.Timestamp.valueOf()`` and \
+``java.sql.Date.valueOf()``.
+* ``maxColumns`` (default ``20480``): defines a hard limit of how 
many columns \
+a record can have.
+* ``maxCharsPerColumn`` (default ``100``): defines the maximum 
number of \
+characters allowed for any given value being read.
+* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing 
with corrupt records \
+during parsing.
+* ``PERMISSIVE`` : sets other fields to ``null`` when it meets 
a corrupted record. \
+When a schema is set by user, it sets ``null`` for extra 
fields.
+* ``DROPMALFORMED`` : ignores the whole corrupted records.
+* ``FAILFAST`` : throws an exception when it meets corrupted 
records.
+
 >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
 >>> df.dtypes
 [('C0', 'str

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master a6428292f -> a832cef11


[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and 
writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12817 from HyukjinKwon/SPARK-13425.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a832cef1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a832cef1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a832cef1

Branch: refs/heads/master
Commit: a832cef11233c6357c7ba7ede387b432e6b0ed71
Parents: a642829
Author: hyukjinkwon 
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 19:05:20 2016 -0700

--
 python/pyspark/sql/readwriter.py| 52 
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 --
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
 :param paths: string, or list of strings, for input path(s).
 
+You can set the following CSV-specific options to deal with CSV files:
+* ``sep`` (default ``,``): sets the single character as a 
separator \
+for each field and value.
+* ``charset`` (default ``UTF-8``): decodes the CSV files by the 
given \
+encoding type.
+* ``quote`` (default ``"``): sets the single character used for 
escaping \
+quoted values where the separator can be part of the value.
+* ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+inside an already quoted value.
+* ``comment`` (default empty string): sets the single character 
used for skipping \
+lines beginning with this character. By default, it is 
disabled.
+* ``header`` (default ``false``): uses the first line as names of 
columns.
+* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether 
or not leading \
+whitespaces from values being read should be skipped.
+* ``ignoreTrailingWhiteSpace`` (default ``false``): defines 
whether or not trailing \
+whitespaces from values being read should be skipped.
+* ``nullValue`` (default empty string): sets the string 
representation of a null value.
+* ``nanValue`` (default ``NaN``): sets the string representation 
of a non-number \
+value.
+* ``positiveInf`` (default ``Inf``): sets the string 
representation of a positive \
+infinity value.
+* ``negativeInf`` (default ``-Inf``): sets the string 
representation of a negative \
+infinity value.
+* ``dateFormat`` (default ``None``): sets the string that 
indicates a date format. \
+Custom date formats follow the formats at 
``java.text.SimpleDateFormat``. This \
+applies to both date type and timestamp type. By default, it 
is None which means \
+trying to parse times and date by 
``java.sql.Timestamp.valueOf()`` and \
+``java.sql.Date.valueOf()``.
+* ``maxColumns`` (default ``20480``): defines a hard limit of how 
many columns \
+a record can have.
+* ``maxCharsPerColumn`` (default ``100``): defines the maximum 
number of \
+characters allowed for any given value being read.
+* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing 
with corrupt records \
+during parsing.
+* ``PERMISSIVE`` : sets other fields to ``null`` when it meets 
a corrupted record. \
+When a schema is set by user, it sets ``null`` for extra 
fields.
+* ``DROPMALFORMED`` : ignores the whole corrupted records.
+* ``FAILFAST`` : throws an exception when it meets corrupted 
records.
+
 >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
 >>> df.dtypes
 [('C0', 'string'), ('C1', 'string')]
@@ -663,6 +702,19 @@ class DataFrameWriter(object):

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

2 matches

Site Navigation

Mail list logo

Footer information