beanscg commented on issue #30859:
URL: https://github.com/apache/arrow/issues/30859#issuecomment-4661716445

   I checked current `main`, and the partial-write behavior looks like it can 
occur after dataset writing has already started:
   
   - `python/pyarrow/parquet/core.py:2121-2125` exposes `write_to_dataset(...)` 
as the Parquet wrapper around dataset writing.
   - `cpp/src/arrow/dataset/dataset_writer.cc:654-710` creates/gets a directory 
queue, obtains a writable chunk, and calls `dir_queue->StartWrite(next_chunk)`. 
If `StartWrite` returns an error, it returns that status after rows/open-file 
accounting has already been touched.
   - The reopened repro with `month_day_nano_interval()` is useful because it 
shows an unsupported type can still leave a file behind.
   - I did not see an open PR that appears to cover issue #30859.
   
   Smallest fix path might be either a preflight Parquet schema-conversion 
check before opening writer output, or an error cleanup path for files created 
before an unsupported-type failure. A regression could assert the unsupported 
interval repro leaves no dataset file on failure.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to