postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes [arrow-adbc]

via GitHub Thu, 12 Sep 2024 09:16:15 -0700


paleolimbot commented on issue #1950:
URL: https://github.com/apache/arrow-adbc/issues/1950#issuecomment-2346722372


   > I would recommend upcasting to the smallest integer type that can hold all 
of the values:
   
   >    uint8 -> int16
   >    uint16 -> int32
   >    uint32 -> int64
   
   This is definitely the safest way to go, and had already been baked in to 
`PostgresType::FromSchema()`. In #2153 we updated the ingestion behaviour to 
actually use `PostgresType::FromSchema()`, and so for a "create" or "append to 
something created by adbc" this should now work:
   
   ``` r
   library(adbcdrivermanager)
   library(nanoarrow)
   
   con <- adbc_database_init(
     adbcpostgresql::adbcpostgresql(),
     uri = 
"postgresql://localhost:5432/postgres?user=postgres&password=password"
   ) |> 
     adbc_connection_init()
   
   df <- tibble::tibble(
     uint8_col = 246:255,
     uint16_col = 65526:65535,
     uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
   )
   
   array <- df |> 
     nanoarrow::as_nanoarrow_array(
       schema = na_struct(
         list(
           uint8_col = na_uint8(),
           uint16_col = na_uint16(),
           uint32_col = na_uint32()
         )
       )
     )
   
   con |> 
     execute_adbc("DROP TABLE IF EXISTS adbc_test")
   
   array |> 
     write_adbc(con, "adbc_test")
   
   con |> 
     read_adbc("select * from adbc_test") |> 
     tibble::as_tibble()
   #> # A tibble: 10 × 3
   #>    uint8_col uint16_col uint32_col
   #>        <int>      <int>      <dbl>
   #>  1       246      65526 2147483648
   #>  2       247      65527 2147483649
   #>  3       248      65528 2147483650
   #>  4       249      65529 2147483651
   #>  5       250      65530 2147483652
   #>  6       251      65531 2147483653
   #>  7       252      65532 2147483654
   #>  8       253      65533 2147483655
   #>  9       254      65534 2147483656
   #> 10       255      65535 2147483657
   ```
   
   Unfortunately, the method we're using to efficiently insert (generate COPY 
data) requires that the types match exactly, so this will fail to append to 
Arrow data that happens to have an unsigned integer column to an existing table:
   
   ```r
   con |> 
     execute_adbc("DROP TABLE IF EXISTS adbc_test")
   con |> 
     execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, 
uint32_col int4)")
   array |> 
     write_adbc(con, "adbc_test", mode = "append")
   #> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] 
Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR:  incorrect binary 
data format
   #> CONTEXT:  COPY adbc_test, line 1, column uint16_col
   ```
   
   > uint64 -> ???
   
   For a fresh insert of Arrow data (i.e., when we are forced to generate a 
CREATE TABLE statement), this should probably be inferred as `bigint` because 
it preserves the "integerness", even though it will error at runtime if passed 
very large values. The workaround here would be to issue your own `CREATE TABLE 
(bigint_col BYTEA)`, and we will need to support generating COPY for various 
postgres types outside the 1:1 mapping that is currently used by the COPY 
writer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes [arrow-adbc]

Reply via email to