spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options
Repository: spark Updated Branches: refs/heads/branch-2.0 a6428292f -> 705172202 [SPARK-13425][SQL] Documentation for CSV datasource options ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon Author: Hyukjin Kwon Closes #12817 from HyukjinKwon/SPARK-13425. (cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220 Branch: refs/heads/branch-2.0 Commit: 7051722023b98f1720142c7b3b41948d275ea455 Parents: a642829 Author: hyukjinkwon Authored: Sun May 1 19:05:20 2016 -0700 Committer: Reynold Xin Committed: Sun May 1 19:05:32 2016 -0700 -- python/pyspark/sql/readwriter.py| 52 .../org/apache/spark/sql/DataFrameReader.scala | 47 -- .../org/apache/spark/sql/DataFrameWriter.scala | 8 +++ 3 files changed, 103 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index ed9e716..cc5e93d 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -282,6 +282,45 @@ class DataFrameReader(object): :param paths: string, or list of strings, for input path(s). +You can set the following CSV-specific options to deal with CSV files: +* ``sep`` (default ``,``): sets the single character as a separator \ +for each field and value. +* ``charset`` (default ``UTF-8``): decodes the CSV files by the given \ +encoding type. +* ``quote`` (default ``"``): sets the single character used for escaping \ +quoted values where the separator can be part of the value. +* ``escape`` (default ``\``): sets the single character used for escaping quotes \ +inside an already quoted value. +* ``comment`` (default empty string): sets the single character used for skipping \ +lines beginning with this character. By default, it is disabled. +* ``header`` (default ``false``): uses the first line as names of columns. +* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading \ +whitespaces from values being read should be skipped. +* ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing \ +whitespaces from values being read should be skipped. +* ``nullValue`` (default empty string): sets the string representation of a null value. +* ``nanValue`` (default ``NaN``): sets the string representation of a non-number \ +value. +* ``positiveInf`` (default ``Inf``): sets the string representation of a positive \ +infinity value. +* ``negativeInf`` (default ``-Inf``): sets the string representation of a negative \ +infinity value. +* ``dateFormat`` (default ``None``): sets the string that indicates a date format. \ +Custom date formats follow the formats at ``java.text.SimpleDateFormat``. This \ +applies to both date type and timestamp type. By default, it is None which means \ +trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \ +``java.sql.Date.valueOf()``. +* ``maxColumns`` (default ``20480``): defines a hard limit of how many columns \ +a record can have. +* ``maxCharsPerColumn`` (default ``100``): defines the maximum number of \ +characters allowed for any given value being read. +* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \ +during parsing. +* ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record. \ +When a schema is set by user, it sets ``null`` for extra fields. +* ``DROPMALFORMED`` : ignores the whole corrupted records. +* ``FAILFAST`` : throws an exception when it meets corrupted records. + >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv') >>> df.dtypes [('C0', 'str
spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options
Repository: spark Updated Branches: refs/heads/master a6428292f -> a832cef11 [SPARK-13425][SQL] Documentation for CSV datasource options ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon Author: Hyukjin Kwon Closes #12817 from HyukjinKwon/SPARK-13425. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a832cef1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a832cef1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a832cef1 Branch: refs/heads/master Commit: a832cef11233c6357c7ba7ede387b432e6b0ed71 Parents: a642829 Author: hyukjinkwon Authored: Sun May 1 19:05:20 2016 -0700 Committer: Reynold Xin Committed: Sun May 1 19:05:20 2016 -0700 -- python/pyspark/sql/readwriter.py| 52 .../org/apache/spark/sql/DataFrameReader.scala | 47 -- .../org/apache/spark/sql/DataFrameWriter.scala | 8 +++ 3 files changed, 103 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index ed9e716..cc5e93d 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -282,6 +282,45 @@ class DataFrameReader(object): :param paths: string, or list of strings, for input path(s). +You can set the following CSV-specific options to deal with CSV files: +* ``sep`` (default ``,``): sets the single character as a separator \ +for each field and value. +* ``charset`` (default ``UTF-8``): decodes the CSV files by the given \ +encoding type. +* ``quote`` (default ``"``): sets the single character used for escaping \ +quoted values where the separator can be part of the value. +* ``escape`` (default ``\``): sets the single character used for escaping quotes \ +inside an already quoted value. +* ``comment`` (default empty string): sets the single character used for skipping \ +lines beginning with this character. By default, it is disabled. +* ``header`` (default ``false``): uses the first line as names of columns. +* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading \ +whitespaces from values being read should be skipped. +* ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing \ +whitespaces from values being read should be skipped. +* ``nullValue`` (default empty string): sets the string representation of a null value. +* ``nanValue`` (default ``NaN``): sets the string representation of a non-number \ +value. +* ``positiveInf`` (default ``Inf``): sets the string representation of a positive \ +infinity value. +* ``negativeInf`` (default ``-Inf``): sets the string representation of a negative \ +infinity value. +* ``dateFormat`` (default ``None``): sets the string that indicates a date format. \ +Custom date formats follow the formats at ``java.text.SimpleDateFormat``. This \ +applies to both date type and timestamp type. By default, it is None which means \ +trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \ +``java.sql.Date.valueOf()``. +* ``maxColumns`` (default ``20480``): defines a hard limit of how many columns \ +a record can have. +* ``maxCharsPerColumn`` (default ``100``): defines the maximum number of \ +characters allowed for any given value being read. +* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \ +during parsing. +* ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record. \ +When a schema is set by user, it sets ``null`` for extra fields. +* ``DROPMALFORMED`` : ignores the whole corrupted records. +* ``FAILFAST`` : throws an exception when it meets corrupted records. + >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv') >>> df.dtypes [('C0', 'string'), ('C1', 'string')] @@ -663,6 +702,19 @@ class DataFrameWriter(object):