beanscg commented on issue #30859: URL: https://github.com/apache/arrow/issues/30859#issuecomment-4661716445
I checked current `main`, and the partial-write behavior looks like it can occur after dataset writing has already started: - `python/pyarrow/parquet/core.py:2121-2125` exposes `write_to_dataset(...)` as the Parquet wrapper around dataset writing. - `cpp/src/arrow/dataset/dataset_writer.cc:654-710` creates/gets a directory queue, obtains a writable chunk, and calls `dir_queue->StartWrite(next_chunk)`. If `StartWrite` returns an error, it returns that status after rows/open-file accounting has already been touched. - The reopened repro with `month_day_nano_interval()` is useful because it shows an unsupported type can still leave a file behind. - I did not see an open PR that appears to cover issue #30859. Smallest fix path might be either a preflight Parquet schema-conversion check before opening writer output, or an error cleanup path for files created before an unsupported-type failure. A regression could assert the unsupported interval repro leaves no dataset file on failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
