[ 
https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488436#comment-17488436
 ] 

Nicola Crane edited comment on ARROW-15599 at 2/8/22, 2:01 PM:
---------------------------------------------------------------

Thanks for reporting this!

Here's a reprex with more verbose output.

{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
tf <- tempfile()
write.csv(data.frame(x = '2018-10-07 19:04:05.005'), tf, row.names = FALSE)

# successfully read in file
read_csv_arrow(tf, as_data_frame = TRUE)
#> # A tibble: 1 × 1
#>   x                  
#>   <dttm>             
#> 1 2018-10-07 20:04:05

# the unit here is seconds - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "T",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid 
value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)

# the unit here is ms - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "t",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to time32[ms]: invalid 
value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)

# the unit here is inferred as ns - does work!
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "?",
  skip = 1,
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x <timestamp[ns]>
{code}

It looks like what's happening here is that the {{col_types}} compact 
representations are mapped to a timestamp with units in seconds ("T") or time32 
objects with units in milliseconds ("t"), but the data itself is actually to 
nanosecond precision.  

You could get round this for now by specifying a schema for the names and 
column types instead of the {{readr}} shortcodes:

{code:r}
read_csv_arrow(
  tf,
  schema = schema(x = timestamp(unit = "us")),
  skip = 1
)
{code}

That said, this is something we should either fix or document.


was (Author: thisisnic):
Thanks for reporting this!

Here's a reprex with more verbose output.

{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
tf <- tempfile()
write.csv(data.frame(x = '2018-10-07 19:04:05.005'), tf, row.names = FALSE)

# successfully read in file
read_csv_arrow(tf, as_data_frame = TRUE)
#> # A tibble: 1 × 1
#>   x                  
#>   <dttm>             
#> 1 2018-10-07 20:04:05

# the unit here is seconds - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "T",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid 
value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)

# the unit here is ms - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "t",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to time32[ms]: invalid 
value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)

# the unit here is inferred as ns - does work!
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "?",
  skip = 1,
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x <timestamp[ns]>
{code}

It looks like what's happening here is that the {{col_types}} compact 
representations are mapped to a timestamp with units in seconds ("T") or time32 
objects with units in milliseconds, but the data is actually to nanosecond 
precision.  

You could get round this for now by specifying a schema for the names and 
column types instead of the {{readr}} shortcodes:

{code:r}
read_csv_arrow(
  tf,
  schema = schema(x = timestamp(unit = "us")),
  skip = 1
)
{code}

That said, this is something we should either fix or document.

> [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or 
> other delimited) file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15599
>                 URL: https://issues.apache.org/jira/browse/ARROW-15599
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 6.0.1
>         Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>            Reporter: SHIMA Tatsuya
>            Priority: Major
>
> I tried to read the csv column type as timestamp, but I could only get it to 
> work well when `col_types` was not specified.
> I'm sorry if I missed something and this is the expected behavior. (It would 
> be great if you could add an example with `col_types` in the documentation.)
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #>     timestamp
> t_string <- tibble::tibble(
>   x = "2018-10-07 19:04:05.005"
> )
> write_csv_arrow(t_string, "tmp.csv")
> read_csv_arrow(
>   "tmp.csv",
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x <timestamp[ns]>
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "?",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x <timestamp[ns]>
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE,
>   skip = 1,
>   timestamp_parsers = "%Y-%m-%d %H:%M:%S"
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to