jorisvandenbossche commented on issue #44030:
URL: https://github.com/apache/arrow/issues/44030#issuecomment-2340565121
> My question is whether I'm missing something about how to do this during
the `pa.csv.read_csv`
And your code looks good. An alternative way to do this is to cast the full
table to a schema where the one column you want to change has the timestamp
type (the `set_column` API is a bit awkward to use I find, casting the full
table _can_ be easier if you that that schema at hand)
> we would need to have some way to distinguish the cases of parsing
something as a string vs parsing it as a unix integer, because you can have
values that could match both.
To illustrate this:
```
In [16]: from pyarrow import csv
In [17]: data = """a
...: 20240101"""
In [18]: import io
In [19]: csv.read_csv(io.BytesIO(data.encode()),
convert_options=csv.ConvertOptions(
...: column_types={"a": pa.timestamp("ms")},
timestamp_parsers=["%Y%m%d"])
...: )
Out[19]:
pyarrow.Table
a: timestamp[ms]
----
a: [[2024-01-01 00:00:00.000]]
In [20]: csv.read_csv(io.BytesIO(data.encode())).cast(pa.schema([("a",
pa.timestamp("s"))]))
Out[20]:
pyarrow.Table
a: timestamp[s]
----
a: [[1970-08-23 06:15:01]]
```
So we would need to "UnixTimestamp" placeholder to indicate in the parsers
argument that something is to be treated as integer instead of being parsed as
string ..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]