[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360301#comment-17360301
 ] 

Weston Pace commented on ARROW-13028:
-------------------------------------

The problem is that the miss may not be detected until some # of blocks have 
been processed.  The file-based CSV reader handles this by going backwards 
through all the already-processed blocks and upcasting to the looser type.  So 
it can be a non-trivial performance hit.  However, the streaming CSV (used by 
the datasets API) isn't so lenient.  It infers type based on the first block 
(default 1MB) of data alone.  The complexity of doing otherwise is pretty 
significant.  I think could cause an issue here.  If the large >32 bit value 
doesn't happen until after the first block you will get parsing errors.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---------------------------------------------------------------
>
>                 Key: ARROW-13028
>                 URL: https://issues.apache.org/jira/browse/ARROW-13028
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nate Clark
>            Assignee: Nate Clark
>            Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to