[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

Jorge Machado (JIRA) Thu, 05 Oct 2017 21:40:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194129#comment-16194129
 ]


Jorge Machado commented on SPARK-20055:
---------------------------------------

[~aash] Should I copy paste that options ?  And there is some docs already


{code:java}
def
csv(paths: String*): DataFrame
 Permalink
Loads CSV files and returns the result as a DataFrame.

This function will go through the input once to determine the input schema if 
inferSchema is enabled. To avoid going through the entire data once, disable 
inferSchema option or specify the schema explicitly using schema.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets the single character as a separator for each field and 
value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values 
where the separator can be part of the value. If you would like to turn off 
quotations, you need to set not null but an empty string. This behaviour is 
different from com.databricks.spark.csv.
escape (default \): sets the single character used for escaping quotes inside 
an already quoted value.
comment (default empty string): sets the single character used for skipping 
lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. 
It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): a flag indicating whether or not 
leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): a flag indicating whether or not 
trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null 
value. Since 2.0.1, this applies to all supported types including the string 
type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive 
infinity value.
negativeInf (default -Inf): sets the string representation of a negative 
infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. 
Custom date formats follow the formats at java.text.SimpleDateFormat. This 
applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that 
indicates a timestamp format. Custom date formats follow the formats at 
java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns (default 20480): defines a hard limit of how many columns a record 
can have.
maxCharsPerColumn (default -1): defines the maximum number of characters 
allowed for any given value being read. By default, it is -1 meaning unlimited 
length
mode (default PERMISSIVE): allows a mode for dealing with corrupt records 
during parsing. It supports the following case-insensitive modes.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a field configured by columnNameOfCorruptRecord. 
To keep corrupt records, an user can set a string type field named 
columnNameOfCorruptRecord in an user-defined schema. If a schema does not have 
the field, it drops corrupt records during parsing. When a length of parsed CSV 
tokens is shorter than an expected length of a schema, it sets null for extra 
fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in 
spark.sql.columnNameOfCorruptRecord): allows renaming the new field having 
malformed string created by PERMISSIVE mode. This overrides 
spark.sql.columnNameOfCorruptRecord.
multiLine (default false): parse one record, which may span multiple lines.
{code}

> Documentation for CSV datasets in SQL programming guide
> -------------------------------------------------------
>
>                 Key: SPARK-20055
>                 URL: https://issues.apache.org/jira/browse/SPARK-20055
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.2.0
>            Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

Reply via email to