Ok, we can change things to fail on type-misspecification. There's no real standard, but the rule "does Excel read this in a sane way?" is pretty effective for determining what you should try parsing and when you should tell people to reformat their data.
Given the current infrastructure, I think the easiest way to read that data would be to split it into separate files. There are other hacks that would work, but your problem is harder than just specifying end-of-data (which can be done by reading N rows) -- it's also specifying start-of-data (which can be done by skipping M rows at the start). -- John On Dec 8, 2014, at 9:24 AM, Simon Byrne <[email protected]> wrote: > > On Monday, 8 December 2014 17:04:10 UTC, John Myles White wrote: > * This package and the current DataFrames code both support specifying the > types of all columns before parsing begins. There's no fast path in > CSVReaders that uses this information to full-advantage because the functions > were designed to never fail -- instead they always enlarge types to ensure > successful parsing. It would be good to think about how the library needs to > be restructured to support both use cases. I believe the DataFrames parser > will fail if the hand-specified types are invalidated by the data. > > I agree that being permissive by default is probably a good idea, but > sometimes it is nice if the parser throws an error if it finds something > unexpected. This could also be useful for the "end-of-data" problem below. > > * Does the CSV standard have anything like END-OF-DATA? It's a very cool > idea, but it seems that you'd need to introduce an arbitrary predicate that > occurs per-row to make things work in the absence of existing conventions. > > Well, there isn't really a standard, just this RFC: > http://www.ietf.org/rfc/rfc4180.txt > which seems to assume end-of-data = end-of-file. > > When I hit this problem the files I was reading weren't actually CSV, but > this: > http://lsbr.niams.nih.gov/bsoft/bsoft_param.html > which have multiple tables per file, ended by a blank line. I think I ended > up devising a hack that would count the number of lines beforehand.
