On Monday, 8 December 2014 17:04:10 UTC, John Myles White wrote: > > * This package and the current DataFrames code both support specifying the > types of all columns before parsing begins. There's no fast path in > CSVReaders that uses this information to full-advantage because the > functions were designed to never fail -- instead they always enlarge types > to ensure successful parsing. It would be good to think about how the > library needs to be restructured to support both use cases. I believe the > DataFrames parser will fail if the hand-specified types are invalidated by > the data. >
I agree that being permissive by default is probably a good idea, but sometimes it is nice if the parser throws an error if it finds something unexpected. This could also be useful for the "end-of-data" problem below. * Does the CSV standard have anything like END-OF-DATA? It's a very cool > idea, but it seems that you'd need to introduce an arbitrary predicate that > occurs per-row to make things work in the absence of existing conventions. > Well, there isn't really a standard, just this RFC: http://www.ietf.org/rfc/rfc4180.txt which seems to assume end-of-data = end-of-file. When I hit this problem the files I was reading weren't actually CSV, but this: http://lsbr.niams.nih.gov/bsoft/bsoft_param.html which have multiple tables per file, ended by a blank line. I think I ended up devising a hack that would count the number of lines beforehand.
