Re: [I] [Python] Parquet write_to_dataset leads to partial write when unsupported datatype is passed in table [arrow]

via GitHub Tue, 09 Jun 2026 09:14:37 -0700


beanscg commented on issue #30859:
URL: https://github.com/apache/arrow/issues/30859#issuecomment-4661716445


   I checked current `main`, and the partial-write behavior looks like it can 
occur after dataset writing has already started:
   
   - `python/pyarrow/parquet/core.py:2121-2125` exposes `write_to_dataset(...)` 
as the Parquet wrapper around dataset writing.
   - `cpp/src/arrow/dataset/dataset_writer.cc:654-710` creates/gets a directory 
queue, obtains a writable chunk, and calls `dir_queue->StartWrite(next_chunk)`. 
If `StartWrite` returns an error, it returns that status after rows/open-file 
accounting has already been touched.
   - The reopened repro with `month_day_nano_interval()` is useful because it 
shows an unsupported type can still leave a file behind.
   - I did not see an open PR that appears to cover issue #30859.
   
   Smallest fix path might be either a preflight Parquet schema-conversion 
check before opening writer output, or an error cleanup path for files created 
before an unsupported-type failure. A regression could assert the unsupported 
interval repro leaves no dataset file on failure.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Parquet write_to_dataset leads to partial write when unsupported datatype is passed in table [arrow]

Reply via email to