[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

GitBox Thu, 17 Mar 2022 06:46:31 -0700


jorisvandenbossche commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r829120331




##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column

Review comment:
       ```suggestion
       Define a date parsing format to get a timestamp type column
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])

Review comment:
       A typical example for "strings_can_be_null" is where you have data that 
uses a empty slot for missing values (eg pandas does this by default for 
writing data):
   
   ```
   In [5]: data = b"a,b\nA,1\n,2\nC,3"
   
   In [6]: print(data.decode())
   a,b
   A,1
   ,2
   C,3
   
   In [9]: csv.read_csv(io.BytesIO(data))
   Out[9]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A","","C"]]
   b: [[1,2,3]]
   
   In [10]: csv.read_csv(io.BytesIO(data), 
convert_options=csv.ConvertOptions(strings_can_be_null=True))
   Out[10]: 
   pyarrow.Table
   a: string
   b: int64
   ----
   a: [["A",null,"C"]]
   b: [[1,2,3]]
   
   
   ```

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:

Review comment:
       It makes it longer, but it might be clearer to have two examples here: 
first only read a subset of columns, and then show you can also list more 
columns and get them included as null typed column.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Reply via email to