kskalski opened a new issue, #4903: URL: https://github.com/apache/arrow-rs/issues/4903
**Describe the bug** When rows processed by schema inference do not contain any data (are empty) for given column, that column is inferred as nullable `DataType::Utf8`. This data type is in fact a "catch-all" that permits any values later on, but it is in fact a limiting and to a degree incorrect behavior, since user is led to assume this column did contain some data and it was string or something that forced string type. **To Reproduce** ```csv int_column,null_column,string_column 1,,"a" 2,,"b" ``` **Expected behavior** Inference should return `int*`, `null`, `utf8` **Additional context** My algorithm uses inference with limited number of rows as a kind of best-effort / incremental performance improvement, when I read some data and see inferred schema has nulls, I may repeat inference with more rows or without row limit. If inference wrongly returns some data-type that isn't there, then I will end up with unnecessarily widened `Utf8` datatime, while in fact later on this column actually contains just ints or booleans. Another use-case is that I have several files of the same shape (or they could be several random offsets into the same file) and I want to infer schemas for each of them, then merge them to see if any still contain nulls. With https://github.com/apache/arrow-rs/issues/4901 and fixing behavior described in this issue I can implement above strategy correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
