People ran into similar issues with such all-NA columns with Parquet as well (with the difference that Parquet actually supports a null type, but if you have a partitioned dataset, this could lead to conflicting schemas). The typical workaround for the user to provide the schema when writing / converting the data to Arrow. For example, for this reason, dask added a "schema" keyword to their "to_parquet" function (https://docs.dask.org/en/latest/generated/dask.dataframe.to_parquet.html), which also allowed to specify the type for just one column, leaving the others to use the normal type inference.
Now, for ORC writing in Arrow itself, I agree it would be good to provide a way to write a column of null type. On Mon, 22 Nov 2021 at 10:52, Antoine Pitrou <anto...@python.org> wrote: > > > Le 21/11/2021 à 19:48, Ian Joiner a écrit : > > I see. > > > > Now the question is what we should do about such columns in the ORC writer > > as well as maybe some other writers since the Null type, as opposed to all > > Null columns of a numeric or binary type, doesn’t exist in such formats. > > We could perhaps add an option to silently turn them into another type, > but they wouldn't roundtrip properly unless we also serialize the Arrow > schema as we do in Parquet. Storing the schema similarly as we do for Parquet might be a good idea in general to improve roundtripping? Nor only for null type, but eg also for timestamp resolution and timezones. > > For now, people will have to detect such columns and cast them manually, > I think. > > Regards > > Antoine.