Re: [I] adbc_ingest for snowflake dropping rows when called repeatedly [arrow-adbc]

via GitHub Tue, 27 Aug 2024 13:53:38 -0700


joellubi commented on issue #2094:
URL: https://github.com/apache/arrow-adbc/issues/2094#issuecomment-2313516354


   I've been able to reproduce this issue in Python, then in Go itself. A bunch 
of files are uploaded via PUT, but sometimes only a subset of them is actually 
copied. In my initial repro I was seeing this failure between 5-10% of the time.
   
   I just pushed up a PR that I believe _improves_ this significantly but I 
don't think it completely solves it: #2106.
   
   It's long been an annoyance to me that the bulk ingestion uploads at minimum 
`NumCPU` parquet files, even if there are only enough records to write into 1 
of the files. The fix here is to prevent empty files from getting uploaded, and 
instead only upload 1 file if there's 1 file worth of data.
   
   If I run the reproduction on my branch with a data profile like your 
@davlee1972 (one file at a time, a few hundred rows per file), I can no longer 
reproduce the issue (i.e. 0% failure rate).
   
   It seems that snowflake may have trouble keeping track of large numbers of 
files on COPY, as the issue does still persist with some larger ingestions that 
contain many small batches (i.e. lots of small files get uploaded PUT). But 
reducing the number of unnecessary PUTs seems to significantly reduce the 
frequency of occurences.
   
   I'm still looking into a more definitive solution here. Everything appears 
to be happening in the correct order (i.e. no race) even when the failure does 
occur, so it seems possible the issue may be on snowflake's side. We may need 
to do our own accounting of uploaded vs staged files within the driver and keep 
running COPY until they match, instead of relying on snowflake to handle this. 
I'll keep this issue and the open PR updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] adbc_ingest for snowflake dropping rows when called repeatedly [arrow-adbc]

Reply via email to