[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

GitBox Tue, 22 Mar 2022 06:20:10 -0700


AlenkaF commented on a change in pull request #12543:
URL: https://github.com/apache/arrow/pull/12543#discussion_r832167636




##########
File path: python/pyarrow/_csv.pyx
##########
@@ -546,7 +641,114 @@ cdef class ConvertOptions(_Weakrefable):
         produce a column of nulls (whose type is selected using
         `column_types`, or null by default).
         This option is ignored if `include_columns` is empty.
+
+    Example
+    -------
+
+    Defining an example file from bytes object:
+
+    >>> import io
+    >>> s = b'''1,2,3
+    ... Flamingo,2,01/03/2022
+    ... Horse,4,02/03/2022
+    ... Brittle stars,5,03/03/2022
+    ... Centipede,100,04/03/2022'''
+
+    Define date parsing format to get a timestamp type column
+    (in case dates are not in ISO format and not converted by default):
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   timestamp_parsers=["%m/%d/%Y", "%d/%m/%Y"])
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    n_legs: int64
+    entry: timestamp[s]
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    n_legs: [[2,4,5,100]]
+    entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,
+    2022-03-03 00:00:00,2022-04-03 00:00:00]]
+
+    Specify which columns to read and add an additional column:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals", "location"],
+    ...                   include_missing_columns=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    location: null
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+    location: [4 nulls]
+
+    Define a column as a dictionary:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: dictionary<values=string, indices=int32, ordered=0>
+    ----
+    animals: [  -- dictionary:
+    ["Flamingo","Horse","Brittle stars","Centipede"]  -- indices:
+    [0,1,2,3]]
+
+    Set upper limit for the number of categories. If the categories
+    is more than the limit, the conversion to dictionary will not
+    happen:
+
+    >>> convert_options = csv.ConvertOptions(
+    ...                   include_columns=["animals"],
+    ...                   auto_dict_encode=True,
+    ...                   auto_dict_max_cardinality=2)
+    >>> csv.read_csv(io.BytesIO(s), convert_options=convert_options)
+    pyarrow.Table
+    animals: string
+    ----
+    animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
+
+    Define strings that should be set to missing:
+
+    >>> convert_options = csv.ConvertOptions(include_columns=["animals"],
+    ...                                      strings_can_be_null = True,
+    ...                                      null_values=["Horse"])

Review comment:
       Oh great, thanks for clarifying! Will correct the example to show this 
option.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on a change in pull request #12543: ARROW-15432: [Python] Address CSV docstrings

Reply via email to