jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r828967518
##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
encoding : str, optional (default 'utf8')
The character encoding of the CSV data. Columns that cannot
decode using this encoding can still be read as Binary.
+
+ Example
+ -------
+
+ Defining an example file from bytes object:
+
+ >>> import io
+ >>> s = b'''1,2,3
+ ... Flamingo,2,2022-03-01
+ ... Horse,4,2022-03-02
+ ... Brittle stars,5,2022-03-03
+ ... Centipede,100,2022-03-04'''
Review comment:
The annoying thing here with the `...`, while correct, is that if you
copy the example to run it yourself, it won't work. I don't know a good
solution though ...
##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
produce a column of nulls (whose type is selected using
`column_types`, or null by default).
This option is ignored if `include_columns` is empty.
+
+ Example
+ -------
+
+ Defining an example file from bytes object:
+
+ >>> import io
+ >>> s = b'''1,2,3
+ ... Flamingo,2,01/03/2022
+ ... Horse,4,02/03/2022
+ ... Brittle stars,5,03/03/2022
+ ... Centipede,100,04/03/2022'''
+
+ Define date parsing format to get a timestamp type column
+ (in case dates are not in ISO format and not converted by default):
+
+ >>> convert_options = csv.ConvertOptions(
+ ... timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
Review comment:
This seems like bad practice to mix values with month-first vs day-first
in a single column, so maybe this is not the best example to show (or maybe use
one with a different delimiter instead, like `["%m/%d/%Y", "%m-%d-%Y"]`)
##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
produce a column of nulls (whose type is selected using
`column_types`, or null by default).
This option is ignored if `include_columns` is empty.
+
+ Example
+ -------
+
+ Defining an example file from bytes object:
+
+ >>> import io
+ >>> s = b'''1,2,3
+ ... Flamingo,2,01/03/2022
+ ... Horse,4,02/03/2022
+ ... Brittle stars,5,03/03/2022
+ ... Centipede,100,04/03/2022'''
+
+ Define date parsing format to get a timestamp type column
+ (in case dates are not in ISO format and not converted by default):
+
+ >>> convert_options = csv.ConvertOptions(
+ ... timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+ >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+ pyarrow.Table
+ animals: string
+ n_legs: int64
+ entry: timestamp[s]
+ ----
+ animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+ n_legs: [[2,4,5,100]]
+ entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+ 2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+ Specify which columns to read and add an additional column:
+
+ >>> convert_options = csv.ConvertOptions(
+ ... include_columns=["animals", "location"],
+ ... include_missing_columns=True)
+ >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+ pyarrow.Table
+ animals: string
+ location: null
+ ----
+ animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+ location: [4 nulls]
+
+ Define a column as a dictionary:
+
+ >>> convert_options = csv.ConvertOptions(
+ ... include_columns=["animals"],
+ ... auto_dict_encode=True)
+ >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+ pyarrow.Table
+ animals: dictionary<values=string, indices=int32, ordered=0>
+ ----
+ animals: [ -- dictionary:
+ ["Flamingo","Horse","Brittle stars","Centipede"] -- indices:
+ [0,1,2,3]]
+
+ Set upper limit for the number of categories. If the categories
+ is more than the limit, the conversion to dictionary will not
+ happen:
+
+ >>> convert_options = csv.ConvertOptions(
+ ... include_columns=["animals"],
+ ... auto_dict_encode=True,
+ ... auto_dict_max_cardinality=2)
+ >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+ pyarrow.Table
+ animals: string
+ ----
+ animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+ Define strings that should be set to missing:
+
+ >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+ ... strings_can_be_null = True,
+ ... null_values=["Horse"])
+ >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+ pyarrow.Table
+ animals: string
+ ----
+ animals: [["Flamingo",null,"Brittle stars","Centipede"]]
+
+ Define values to be True and False when converting a column
+ into a bool type:
+
+ >>> convert_options = csv.ConvertOptions(
+ ... include_columns=["animals"],
+ ... false_values=["Flamingo","Horse"],
+ ... true_values=["Brittle stars","Centipede"])
Review comment:
We could also decide to use a few different variants of the example csv
data, which could make the example more realistic. For example, you could have
a columns with "F" and "N" values, or "Yes" and "No" or something like that.
##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
encoding : str, optional (default 'utf8')
The character encoding of the CSV data. Columns that cannot
decode using this encoding can still be read as Binary.
+
+ Example
+ -------
+
+ Defining an example file from bytes object:
+
+ >>> import io
+ >>> s = b'''1,2,3
+ ... Flamingo,2,2022-03-01
+ ... Horse,4,2022-03-02
+ ... Brittle stars,5,2022-03-03
+ ... Centipede,100,2022-03-04'''
Review comment:
Checking the pandas guide, where we seem to solve this by defining it as
a single-line string with manual `\n` in it for line breaks, but then also
print it first to be able to see what it looks like. For example see
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#specifying-column-data-types
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]