zeroshade commented on issue #3480:
URL: https://github.com/apache/arrow-adbc/issues/3480#issuecomment-3378708958
@CurtHagenlocher So this is interesting, I put together a pure Go test using
the NYC Taxi dataset (since you mentioned 20M rows so I figured that was an
easy way to get a lot of rows).
I have a subset of around 27,982,347 rows that I used to test with Bulk
Ingestion to snowflake using the ADBC driver:
* With the pure default settings, I ended up with around 6 files uploaded
and the total ingestion took under 1 minute
* When I artificially forced it to use smaller batches coming from the data,
I ended up with roughly 90 files uploaded (each one smaller) and the entire
ingestion took just over 1.5 minutes
I can mess around with the settings to control the performance, but so far I
haven't managed to create the case you had with it taking over an hour.
Can you provide any more details about your setup that took over an hour?
* Which driver manager/language were you calling the ADBC driver from?
* What was the batch size of the record reader you were providing the data
through?
* Were you using the default options or custom concurrency settings?
* Is 27.9M rows from the NYC Yellow Taxi dataset a sufficient representation
of your dataset that it is a similar test?
For reference, here's the Go code for my little test:
```go
drv := snowflake.NewDriver(memory.DefaultAllocator)
db, err := drv.NewDatabase(map[string]string{
"uri": os.Getenv("SNOWFLAKE_URI"),
})
if err != nil {
panic(err)
}
defer db.Close()
ctx := context.Background()
conn, err := db.Open(ctx)
if err != nil {
panic(err)
}
defer conn.Close()
matches, err := filepath.Glob("*.parquet")
if err != nil {
panic(err)
}
// ... create a record reader for the list of parquet files
stmt, err := conn.NewStatement()
if err != nil {
panic(err)
}
defer stmt.Close()
stmt.BindStream(ctx, reader)
stmt.SetOption(adbc.OptionKeyIngestTargetTable, "adbc_slow_ingest")
n, err := stmt.ExecuteUpdate(ctx)
if err != nil {
panic(err)
}
fmt.Println("records ingested:", n)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]