Thanks, Simon. In response to your comments:

* This package and the current DataFrames code both support streaming CSV files 
in minibatches. It's a little hard to do this with the current DataFrames 
reader, but it is possible. It is designed to be easier with CSVReaders.

* This package and the current DataFrames code both support specifying the 
types of all columns before parsing begins. There's no fast path in CSVReaders 
that uses this information to full-advantage because the functions were 
designed to never fail -- instead they always enlarge types to ensure 
successful parsing. It would be good to think about how the library needs to be 
restructured to support both use cases. I believe the DataFrames parser will 
fail if the hand-specified types are invalidated by the data.

* I'm hopeful that the String rewrite Stefan is involved with will make it 
easier to write parser functions that take in an Array{Uint8} and return values 
of type T. There's certainly no reason that CSVReaders couldn't be configured 
to use other parser functions, although it might be best not to pass parsing 
functions in as function arguments since the parsing functions might not get 
inlined. At the moment, I'd prefer to see new parsers be added to the default 
list and therefore available to everyone. This is particularly relevant to me, 
since I want to add support for reading in data from Hive tables -- which 
require parsing Array and Map objects from CSV-style files.

One thing that makes parsing tricky is that type inference requires that all 
parseable types be placed into a linear order: if parsing as Int fails, the 
parser falls over to Float64, then Bool, then UTF8String. Coming up with a 
design that handles arbitrary types in a non-linear tree, while still 
supporting automatic type inference, seems tricky.

* Does the CSV standard have anything like END-OF-DATA? It's a very cool idea, 
but it seems that you'd need to introduce an arbitrary predicate that occurs 
per-row to make things work in the absence of existing conventions.

 -- John

On Dec 8, 2014, at 8:51 AM, Simon Byrne <[email protected]> wrote:

> Very nice. I was thinking about this recently when I came across the rust csv 
> library:
> http://burntsushi.net/rustdoc/csv/
> 
> It had a few neat features that I thought were useful:
> * the ability to iterate by row, without saving the entire table to an object 
> first (i.e. like awk)
> * the option to specify the type of each column (to improve performance)
> 
> Some other things I've often wished for in CSV libraries:
> * be able to specify an arbitrary functions for mapping a string to data type 
> (e.g. strip out currency symbols, fix funny formatting, etc.)
> * be able to specify a "end of data" rule, other than end-of-file or number 
> of lines (e.g. stop on an empty line)
> 
> s
> 
> On Monday, 8 December 2014 05:35:02 UTC, John Myles White wrote:
> Over the last month or so, I've been slowly working on a new library that 
> defines an abstract toolkit for writing CSV parsers. The goal is to provide 
> an abstract interface that users can implement in order to provide functions 
> for reading data into their preferred data structures from CSV files. In 
> principle, this approach should allow us to unify the code behind Base's 
> readcsv and DataFrames's readtable functions.
> 
> The library is still very much a work-in-progress, but I wanted to let others 
> see what I've done so that I can start getting feedback on the design.
> 
> Because the library makes heavy use of Nullables, you can only try out the 
> library on Julia 0.4. If you're interested, it's available at 
> https://github.com/johnmyleswhite/CSVReaders.jl
> 
> For now, I've intentionally given very sparse documentation to discourage 
> people from seriously using the library before it's officially released. But 
> there are some examples in the README that should make clear how the library 
> is intended to be used.
> 
>  -- John
> 

Reply via email to