jorisvandenbossche commented on issue #44030:
URL: https://github.com/apache/arrow/issues/44030#issuecomment-2340565121

   > My question is whether I'm missing something about how to do this during 
the `pa.csv.read_csv`
   
   And your code looks good. An alternative way to do this is to cast the full 
table to a schema where the one column you want to change has the timestamp 
type (the `set_column` API is a bit awkward to use I find, casting the full 
table _can_ be easier if you that that schema at hand)
   
   
   
   > we would need to have some way to distinguish the cases of parsing 
something as a string vs parsing it as a unix integer, because you can have 
values that could match both.
   
   To illustrate this:
   
   ```
   In [16]: from pyarrow import csv
   
   In [17]: data = """a
       ...: 20240101"""
   
   In [18]: import io
   
   In [19]: csv.read_csv(io.BytesIO(data.encode()), 
convert_options=csv.ConvertOptions(
       ...:     column_types={"a": pa.timestamp("ms")}, 
timestamp_parsers=["%Y%m%d"])
       ...: )
   Out[19]: 
   pyarrow.Table
   a: timestamp[ms]
   ----
   a: [[2024-01-01 00:00:00.000]]
   
   In [20]: csv.read_csv(io.BytesIO(data.encode())).cast(pa.schema([("a", 
pa.timestamp("s"))]))
   Out[20]: 
   pyarrow.Table
   a: timestamp[s]
   ----
   a: [[1970-08-23 06:15:01]]
   ```
   
   So we would need to "UnixTimestamp" placeholder to indicate in the parsers 
argument that something is to be treated as integer instead of being parsed as 
string ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to