mdavis-xyz opened a new issue, #39839:
URL: https://github.com/apache/arrow/issues/39839

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a large set of CSV files I want to read with pyarrow. It's too large 
to fit into memory. So I'm using 
[`pyarrow.dataset.dataset`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html)
 to stream it into a parquet file.
   
   I can successfully parse a timestamp like `2016/04/20 10:12:10`. But I 
cannot parse one like `2016/04/20 10:12:10.123` or `2016/04/20 
10:12:10.123456`, even when I add `.%f`.
   
   `data.csv`
   ```
   a,t
   1,2016/04/20 10:12:10.123456
   ```
   
   ```
   import pyarrow as pa
   
   schema = {
       'x': pa.int64(),
       't': pa.timestamp('us'),
   }
   dataset = ds.dataset(
       source='data.csv', 
       format=ds.CsvFileFormat(
           convert_options=csv.ConvertOptions(
               timestamp_parsers=[
                   "%Y/%m/%d %H:%M:%S.%f",
                   "%Y/%m/%d %H:%M:%S",
               ]
           )
       ),
       schema=pyarrow.schema(schema)
   )
   
   dataset.to_table().to_pandas()
   ```
   
   This results in an error:
   
   > ArrowInvalid: Could not open CSV input source 
'/home/matthew/data/debug/testcsv/data.csv': Invalid: In CSV column #1: Row #2: 
CSV conversion error to timestamp[us]: invalid value '2016/04/20 
10:12:10.123456'
   
   Note that [the documentation for 
`timestamp_parsers`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions)
 says:
   
   > A sequence of strptime()-compatible format strings, tried in order when 
attempting to infer or convert timestamp values (the special value ISO8601() 
can also be given). By default, a fast built-in ISO-8601 parser is used.
   
   And using plain `datetime.datetime.strptime` does work for this formatting 
string.
   
   ```
   from datetime import datetime
   datetime.strptime("2016/04/20 10:12:10.123456", "%Y/%m/%d %H:%M:%S.%f")
   ```
   
   If I delete the microsecond component in the CSV, it runs without error. If 
I also delete the first format string, leaving only the one with `.%f`, I get 
an error, as expected. If I try with a CSV without the microsecond component, 
and with the 2 format strings swapped, it works. This shows the pyarrow is 
indeed using the format strings I'm trying to give it.
   
   Note that for my real use case my data has only 3 decimal digits, not 6. 
(Initially I wondered whether `%f` only works for 6. But plain 
`datetime.strptime` works with 3 too.) For my use case I actually don't care if 
the fractional part is discarded.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to