lidavidm commented on pull request #10060: URL: https://github.com/apache/arrow/pull/10060#issuecomment-821323392
Note this is somewhat of a regression for CSV files/if you call dim.Dataset in R as now we'll have to scan files instead of just immediately returning NA. We do have some options: - We could add an option to just fail if a "cheap" count can't be performed, so that R could fall back to reporting just NA. - We could optimize the CSV case like the IPC and Parquet ones. This should be possible when `newlines_in_values` is not set and needs some consideration for `ignore_empty_lines`. This may or may not not actually be all that much cheaper than loading the data. Also, I need to refactor this to pass around a ScanOptions instead of an IOContext. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
