Thanks, Simon. In response to your comments:
* This package and the current DataFrames code both support streaming CSV files
in minibatches. It's a little hard to do this with the current DataFrames
reader, but it is possible. It is designed to be easier with CSVReaders.
* This package and the current DataFrames code both support specifying the
types of all columns before parsing begins. There's no fast path in CSVReaders
that uses this information to full-advantage because the functions were
designed to never fail -- instead they always enlarge types to ensure
successful parsing. It would be good to think about how the library needs to be
restructured to support both use cases. I believe the DataFrames parser will
fail if the hand-specified types are invalidated by the data.
* I'm hopeful that the String rewrite Stefan is involved with will make it
easier to write parser functions that take in an Array{Uint8} and return values
of type T. There's certainly no reason that CSVReaders couldn't be configured
to use other parser functions, although it might be best not to pass parsing
functions in as function arguments since the parsing functions might not get
inlined. At the moment, I'd prefer to see new parsers be added to the default
list and therefore available to everyone. This is particularly relevant to me,
since I want to add support for reading in data from Hive tables -- which
require parsing Array and Map objects from CSV-style files.
One thing that makes parsing tricky is that type inference requires that all
parseable types be placed into a linear order: if parsing as Int fails, the
parser falls over to Float64, then Bool, then UTF8String. Coming up with a
design that handles arbitrary types in a non-linear tree, while still
supporting automatic type inference, seems tricky.
* Does the CSV standard have anything like END-OF-DATA? It's a very cool idea,
but it seems that you'd need to introduce an arbitrary predicate that occurs
per-row to make things work in the absence of existing conventions.
-- John
On Dec 8, 2014, at 8:51 AM, Simon Byrne <[email protected]> wrote:
> Very nice. I was thinking about this recently when I came across the rust csv
> library:
> http://burntsushi.net/rustdoc/csv/
>
> It had a few neat features that I thought were useful:
> * the ability to iterate by row, without saving the entire table to an object
> first (i.e. like awk)
> * the option to specify the type of each column (to improve performance)
>
> Some other things I've often wished for in CSV libraries:
> * be able to specify an arbitrary functions for mapping a string to data type
> (e.g. strip out currency symbols, fix funny formatting, etc.)
> * be able to specify a "end of data" rule, other than end-of-file or number
> of lines (e.g. stop on an empty line)
>
> s
>
> On Monday, 8 December 2014 05:35:02 UTC, John Myles White wrote:
> Over the last month or so, I've been slowly working on a new library that
> defines an abstract toolkit for writing CSV parsers. The goal is to provide
> an abstract interface that users can implement in order to provide functions
> for reading data into their preferred data structures from CSV files. In
> principle, this approach should allow us to unify the code behind Base's
> readcsv and DataFrames's readtable functions.
>
> The library is still very much a work-in-progress, but I wanted to let others
> see what I've done so that I can start getting feedback on the design.
>
> Because the library makes heavy use of Nullables, you can only try out the
> library on Julia 0.4. If you're interested, it's available at
> https://github.com/johnmyleswhite/CSVReaders.jl
>
> For now, I've intentionally given very sparse documentation to discourage
> people from seriously using the library before it's officially released. But
> there are some examples in the README that should make clear how the library
> is intended to be used.
>
> -- John
>