joellubi commented on issue #2094: URL: https://github.com/apache/arrow-adbc/issues/2094#issuecomment-2313516354
I've been able to reproduce this issue in Python, then in Go itself. A bunch of files are uploaded via PUT, but sometimes only a subset of them is actually copied. In my initial repro I was seeing this failure between 5-10% of the time. I just pushed up a PR that I believe _improves_ this significantly but I don't think it completely solves it: #2106. It's long been an annoyance to me that the bulk ingestion uploads at minimum `NumCPU` parquet files, even if there are only enough records to write into 1 of the files. The fix here is to prevent empty files from getting uploaded, and instead only upload 1 file if there's 1 file worth of data. If I run the reproduction on my branch with a data profile like your @davlee1972 (one file at a time, a few hundred rows per file), I can no longer reproduce the issue (i.e. 0% failure rate). It seems that snowflake may have trouble keeping track of large numbers of files on COPY, as the issue does still persist with some larger ingestions that contain many small batches (i.e. lots of small files get uploaded PUT). But reducing the number of unnecessary PUTs seems to significantly reduce the frequency of occurences. I'm still looking into a more definitive solution here. Everything appears to be happening in the correct order (i.e. no race) even when the failure does occur, so it seems possible the issue may be on snowflake's side. We may need to do our own accounting of uploaded vs staged files within the driver and keep running COPY until they match, instead of relying on snowflake to handle this. I'll keep this issue and the open PR updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
