jgehrcke commented on issue #14968: URL: https://github.com/apache/arrow/issues/14968#issuecomment-1355168902
Understood the problem :). Manual bisection has shown that this line is where things crash: https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/python/pyarrow/_dataset.pyx#L778 `sp.get()` above returns whatever `self.format.DefaultWriteOptions()` [here](https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/python/pyarrow/_dataset.pyx#L890) returns. And in case of the ORC file format that's ```cpp std::shared_ptr<FileWriteOptions> OrcFileFormat::DefaultWriteOptions() { // TODO (https://issues.apache.org/jira/browse/ARROW-13796) return nullptr; } ``` as defined in [cpp/src/arrow/dataset/file_orc.cc](https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/cpp/src/arrow/dataset/file_orc.cc#L221). Trying to invoke `type_name()` of `nullptr` explains the segfault. Quote from JIRA ticket: > https://github.com/apache/arrow/pull/10991 added basic support for ORC file format in the Datasets API, but didn't yet add support to write datasets to the ORC format. https://arrow.apache.org/docs/python/dataset.html#dataset currently says: > Currently, only Parquet, ORC, Feather / Arrow IPC, and CSV files are supported. That kind of suggests that ORC support is complete. However, the [`write_dataset()` docs](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset) are explicit in the sense that ORC is _not_ documented as being supported: > Currently supported: “parquet”, “ipc”/”arrow”/”feather”, and “csv”. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
