kskalski opened a new issue, #4903:
URL: https://github.com/apache/arrow-rs/issues/4903

   **Describe the bug**
   When rows processed by schema inference do not contain any data (are empty) 
for given column, that column is inferred as nullable `DataType::Utf8`. This 
data type is in fact a "catch-all" that permits any values later on, but it is 
in fact a limiting and to a degree incorrect behavior, since user is led to 
assume this column did contain some data and it was string or something that 
forced string type.
   
   **To Reproduce**
   ```csv
   int_column,null_column,string_column
   1,,"a"
   2,,"b"
   ```
   
   **Expected behavior**
   Inference should return `int*`, `null`, `utf8`
   
   **Additional context**
   My algorithm uses inference with limited number of rows as a kind of 
best-effort / incremental performance improvement, when I read some data and 
see inferred schema has nulls, I may repeat inference with more rows or without 
row limit. If inference wrongly returns some data-type that isn't there, then I 
will end up with unnecessarily widened `Utf8` datatime, while in fact later on 
this column actually contains just ints or booleans.
   
   Another use-case is that I have several files of the same shape (or they 
could be several random offsets into the same file) and I want to infer schemas 
for each of them, then merge them to see if any still contain nulls. 
   With https://github.com/apache/arrow-rs/issues/4901 and fixing behavior 
described in this issue I can implement above strategy correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to