[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360323#comment-17360323
 ] 

Nate Clark commented on ARROW-13028:
------------------------------------

Ideally one would pass the column types if they are known but for my use case I 
am using the type inference of the reader to know what the types of the columns 
are. When relying on the reader to get the types the only way to get 32 bit 
values would be to re-parse the csv forcing the type to a 32bit value and if it 
isn't a 32 bit value it will fail.

 

It is true that if one of the later blocks did have a 64bit number that would 
cause a parsing error but the same would be true if the column was inferred as 
int but it was in fact a float or the column is empty 40% of the time and the 
first block happens to not have data for the column. This is more of a 
limitation that the schema is determined by the first block and cannot change 
after that.

 

One of the reasons that the default is to not try 32bit values is to avoid the 
potential parse errors on subsequent blocks so this should only really be used 
if the caller knows all numeric columns can be represented in 32 bit or can 
handle the parse error.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---------------------------------------------------------------
>
>                 Key: ARROW-13028
>                 URL: https://issues.apache.org/jira/browse/ARROW-13028
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nate Clark
>            Assignee: Nate Clark
>            Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to