[
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
]
Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM:
----------------------------------------------------------------
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill =
TRUE, comment.char = "", ...)
{code}
Where:
* header: a logical value indicating whether the file contains the names of the
variables as its first line.
* sep: the field separator character.
* quote: the set of quoting characters. To disable quoting altogether, use
‘quote = ""’
* dec: the character used in the file for decimal points.
* fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are
implicitly added.
_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None,
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0,
skipinitialspace=False, lineterminator=None, header='infer', index_col=None,
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0,
na_values=None, na_fvalues=None, true_values=None, false_values=None,
delimiter=None, converters=None, dtype=None, usecols=None, engine=None,
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False,
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True,
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None,
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False,
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None,
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True,
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill =
TRUE, comment.char = "", ...)
{code}
Where:
header: a logical value indicating whether the file contains the names of the
variables as its first line.
sep: the field separator character.
quote: the set of quoting characters. To disable quoting altogether, use ‘quote
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are
implicitly added.
_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None,
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0,
skipinitialspace=False, lineterminator=None, header='infer', index_col=None,
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0,
na_values=None, na_fvalues=None, true_values=None, false_values=None,
delimiter=None, converters=None, dtype=None, usecols=None, engine=None,
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False,
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True,
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None,
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False,
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None,
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True,
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
> CSV import to SchemaRDDs
> ------------------------
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Michael Armbrust
> Priority: Minor
>
> I think the first step it to design the interface that we want to present to
> users. Mostly this is defining options when importing. Off the top of my
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or
> are they just strings?
> - what types of quoting / escaping do we want to support?
--
This message was sent by Atlassian JIRA
(v6.2#6252)