CurtHagenlocher commented on issue #3480: URL: https://github.com/apache/arrow-adbc/issues/3480#issuecomment-3378924147
Thanks for looking! I'm fairly convinced at this point that there's nothing obviously wrong with the driver itself. While I'm waiting to hear back from Snowflake, I'll try to find the time to make a standalone repro. The existing repro happens in our product, which is implemented in C# and about as far away from "standalone" as you can imagine. It uses entirely default settings for ingestion. The input batches are relatively small, but from what I can tell of the driver source the input batch size is entirely dissociated from the number of rows in the uploaded Parquet files. It eventually uploads 72 individual files. The first 48 are ~13MB in size and have ~354k rows while the remaining 24 are ~6MB and have ~157k rows. This matches my read of the code, which suggests that records are individually queued to a channel and then picked up by one of N readers that are building the Parquet files. The default value of N is `runtime.NumCPU` and this machine does indeed have 24 logical CPUs. I was surprised by the records apparently being individually queued to channels as I would expect this to have a lot of overhead, but that doesn't really appear to be the case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
