[GitHub] [arrow] jgehrcke commented on issue #14968: [Python] `write_dataset(table, format='orc')`: segfault

GitBox Fri, 16 Dec 2022 08:24:18 -0800


jgehrcke commented on issue #14968:
URL: https://github.com/apache/arrow/issues/14968#issuecomment-1355168902


   Understood the problem :).
   
   Manual bisection has shown that this line is where things crash:
   
   
https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/python/pyarrow/_dataset.pyx#L778
   
   `sp.get()` above returns whatever `self.format.DefaultWriteOptions()` 
[here](https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/python/pyarrow/_dataset.pyx#L890)
 returns.
   
   And in case of the ORC file format that's 
   
   ```cpp 
   std::shared_ptr<FileWriteOptions> OrcFileFormat::DefaultWriteOptions() {
     // TODO (https://issues.apache.org/jira/browse/ARROW-13796)
     return nullptr;
   }
   ```
   
   as defined in 
[cpp/src/arrow/dataset/file_orc.cc](https://github.com/apache/arrow/blob/41b57d683ea73329af083540e6f67007bdd1d0c4/cpp/src/arrow/dataset/file_orc.cc#L221).
   
   Trying to invoke `type_name()` of `nullptr` explains the segfault.
   
   Quote from JIRA ticket:
   
   > https://github.com/apache/arrow/pull/10991 added basic support for ORC 
file format in the Datasets API, but didn't yet add support to write datasets 
to the ORC format.
   
   https://arrow.apache.org/docs/python/dataset.html#dataset currently says:
   
   > Currently, only Parquet, ORC, Feather / Arrow IPC, and CSV files are 
supported. 
   
   That kind of suggests that ORC support is complete.
   
   However, the [`write_dataset()` 
docs](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset)
 are explicit in the sense that ORC is _not_ documented as being supported:
   
   >  Currently supported: “parquet”, “ipc”/”arrow”/”feather”, and “csv”.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jgehrcke commented on issue #14968: [Python] `write_dataset(table, format='orc')`: segfault

Reply via email to