[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

GitBox Thu, 24 Mar 2022 04:49:36 -0700


jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r834214328




##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": 
pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:

Review comment:
       ```suggestion
       Define a column as a dictionary type:
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": 
pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:

Review comment:
       Maybe also mention that by default only the string/binary columns are 
dictionary encoded?

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -299,6 +352,48 @@ cdef class ParseOptions(_Weakrefable):
         parsing (because of a mismatching number of columns).
         It should accept a single InvalidRow argument and return either
         "skip" or "error" depending on the desired outcome.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals;n_legs;entry
+    ... Flamingo;2;2022-03-01
+    ... # Comment here:
+    ... Horse;4;2022-03-02
+    ... Brittle stars;5;2022-03-03
+    ... Centipede;100;2022-03-04'''
+
+    Read the data from a file skipping rows with comments
+    and defining the delimiter:
+
+    >>> from pyarrow import csv
+
+    >>> class InvalidRowHandler:
+    ...     def __init__(self, result):
+    ...         self.result = result
+    ...     def __call__(self, row):
+    ...         if row.text.startswith("# "):
+    ...             return self.result
+    ...         else:
+    ...             return 'error'
+    ...
+    >>> skip_handler = InvalidRowHandler('skip')

Review comment:
       We used such a class in the tests (to make it more flexible, and to test 
what was passed), but for this example here, it could maybe also be a simpler 
function? Something like:
   
   ```
   def skip_comment(row):
       if row.text.startswith("# "):
           return "skip"
       return "error"
   ```
   
   (didn't test if it works)

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,143 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Examples
+    --------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''animals,n_legs,entry,fast
+    ... Flamingo,2,01/03/2022,Yes
+    ... Horse,4,02/03/2022,Yes
+    ... Brittle stars,5,03/03/2022,No
+    ... Centipede,100,04/03/2022,No
+    ... ,6,05/03/2022,'''
+
+    Change the type of a column:
+
+    >>> import pyarrow as pa
+    >>> convert_options = csv.ConvertOptions(column_types={"n_legs": 
pa.float64()})
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: double
+    entry: string
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Define a date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    fast: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [["Yes","Yes","No","No",""]]
+
+    Specify a subset of columns to be read:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+
+    List additional column to be included as a null typed column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "n_legs", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+    n_legs: [[2,4,5,100,6]]
+    location: [5 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    n_legs: int64
+    entry: timestamp[s]
+    fast: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede",""]  -- indices:
+    [0,1,2,3,4]]
+    n_legs: [[2,4,5,100,6]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,
+    2022-04-03 00:00:00,2022-05-03 00:00:00]]
+    fast: [  -- dictionary:
+    ["Yes","No",""]  -- indices:
+    [0,0,1,1,2]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
+
+    Set empty strings to missing values:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals", 
"n_legs"],
+    ...                   strings_can_be_null = True)

Review comment:
       ```suggestion
       ...                   strings_can_be_null=True)
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -121,6 +121,59 @@ cdef class ReadOptions(_Weakrefable):
     encoding : str, optional (default 'utf8')
         The character encoding of the CSV data.  Columns that cannot
         decode using this encoding can still be read as Binary.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,2022-03-01
+    ... Horse,4,2022-03-02
+    ... Brittle stars,5,2022-03-03
+    ... Centipede,100,2022-03-04'''

Review comment:
       Idea from our call: use `\\n` if that also renders fine in the html docs?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Reply via email to