Re: [julia-users] [WIP] CSVReaders.jl

John Myles White Sun, 14 Dec 2014 08:33:06 -0800

Iain, I didn't implement that function because it's pretty wasteful of memory. 
Instead there's a non-public function readnrows() that I'll make public, which 
allows you to do incremental reading. The thing to keep in mind is that 
incremental reading is tricky: you need to deal with the header and other 
things (like what happens if you see a blank row). That's why I've currently 
kept that functionality private.


 -- John

On Dec 8, 2014, at 1:42 PM, Iain Dunning <[email protected]> wrote:

> Tried it out (built Julia 0.4 just to do it!), made a CSV-to-JSON type thing:
> 
> https://github.com/johnmyleswhite/CSVReaders.jl/issues/1
> 
> Quite excited about this - I find myself writing code that basically mangles 
> a row into a type pretty often.
> In fact, 90% of my needs would be satisfied by a variant of readall that 
> takes a type, reads a row, and calls a function like 
> 
> function readrow(::Type{T}, values::Vector{Any})
>   # ...
>   return T(...)
> end
> 
> and returns a Vector{T}.
> 
> Not sure how that fits in with the design of this.
> 
> Cheers,
> Iain
> 
> 
> On Monday, December 8, 2014 1:29:46 PM UTC-5, Tim Holy wrote:
> Right, indeed I meant to suggest making the conversion to matrix form the 
> very 
> last step of the process. But obviously you didn't need that suggestion :-). 
> 
> --Tim 
> 
> On Monday, December 08, 2014 10:20:00 AM John Myles White wrote: 
> > Looking at this again, the problem with doing reshape/transpose is that 
> > it's 
> > very awkward when trying to read data in a stream, since you need to undo 
> > the reshape and transpose before starting to read from the stream again. I 
> > think the best solution to getting a row-major matrix of data is to add a 
> > wrapper around the readall method from this package that handles the final 
> > reshape and transpose operations when you're not reading in streaming data. 
> > 
> >  -- John 
> > 
> > On Dec 8, 2014, at 9:25 AM, Tim Holy <[email protected]> wrote: 
> > > Does the reshape/transpose really take any appreciable time (compared to 
> > > the I/O)? 
> > > 
> > > --Tim 
> > > 
> > > On Monday, December 08, 2014 09:14:35 AM John Myles White wrote: 
> > >> Yes, this is how I've been doing things so far. 
> > >> 
> > >> -- John 
> > >> 
> > >> On Dec 8, 2014, at 9:12 AM, Tim Holy <[email protected]> wrote: 
> > >>> My suspicion is you should read into a 1d vector (and use `append!`), 
> > >>> then 
> > >>> at the end do a reshape and finally a transpose. I bet that will be 
> > >>> many 
> > >>> times faster than any other alternative, because we have a really fast 
> > >>> transpose now. 
> > >>> 
> > >>> The only disadvantage I see is taking twice as much memory as would be 
> > >>> minimally needed. (This can be fixed once we have row-major arrays.) 
> > >>> 
> > >>> --Tim 
> > >>> 
> > >>> On Monday, December 08, 2014 08:38:06 AM John Myles White wrote: 
> > >>>> I believe/hope the proposed solution will work for most cases, 
> > >>>> although 
> > >>>> there's still a bunch of performance work left to be done. I think the 
> > >>>> decoupling problem isn't as hard as it might seem since there are very 
> > >>>> clearly distinct stages in parsing a CSV file. But we'll find out if 
> > >>>> the 
> > >>>> indirection I've introduced causes performance problems when things 
> > >>>> can't 
> > >>>> be inlined. 
> > >>>> 
> > >>>> While writing this package, I found the two most challenging problems 
> > >>>> to 
> > >>>> be: 
> > >>>> 
> > >>>> (A) The disconnect between CSV files providing one row at a time and 
> > >>>> Julia's usage of column major arrays, which encourage reading one 
> > >>>> column 
> > >>>> at a time. (B) The inability to easily resize! a matrix. 
> > >>>> 
> > >>>> -- John 
> > >>>> 
> > >>>> On Dec 8, 2014, at 5:16 AM, Stefan Karpinski <[email protected]> 
> > > 
> > > wrote: 
> > >>>>> Doh. Obfuscate the code quick, before anyone uses it! This is very 
> > >>>>> nice 
> > >>>>> and something I've always felt like we need for data formats like CSV 
> > >>>>> – 
> > >>>>> a 
> > >>>>> way of decoupling the parsing of the format from the populating of a 
> > >>>>> data 
> > >>>>> structure with that data. It's a tough problem. 
> > >>>>> 
> > >>>>> On Mon, Dec 8, 2014 at 8:08 AM, Tom Short <[email protected]> 
> > >>>>> wrote: 
> > >>>>> Exciting, John! Although your documentation may be "very sparse", the 
> > >>>>> code 
> > >>>>> is nicely documented. 
> > >>>>> 
> > >>>>> On Mon, Dec 8, 2014 at 12:35 AM, John Myles White 
> > >>>>> <[email protected]> wrote: Over the last month or so, I've been 
> > >>>>> slowly working on a new library that defines an abstract toolkit for 
> > >>>>> writing CSV parsers. The goal is to provide an abstract interface 
> > >>>>> that 
> > >>>>> users can implement in order to provide functions for reading data 
> > >>>>> into 
> > >>>>> their preferred data structures from CSV files. In principle, this 
> > >>>>> approach should allow us to unify the code behind Base's readcsv and 
> > >>>>> DataFrames's readtable functions. 
> > >>>>> 
> > >>>>> The library is still very much a work-in-progress, but I wanted to 
> > >>>>> let 
> > >>>>> others see what I've done so that I can start getting feedback on the 
> > >>>>> design. 
> > >>>>> 
> > >>>>> Because the library makes heavy use of Nullables, you can only try 
> > >>>>> out 
> > >>>>> the 
> > >>>>> library on Julia 0.4. If you're interested, it's available at 
> > >>>>> https://github.com/johnmyleswhite/CSVReaders.jl 
> > >>>>> 
> > >>>>> For now, I've intentionally given very sparse documentation to 
> > >>>>> discourage 
> > >>>>> people from seriously using the library before it's officially 
> > >>>>> released. 
> > >>>>> But there are some examples in the README that should make clear how 
> > >>>>> the 
> > >>>>> library is intended to be used.> 
> > >>>>> -- John 
>

Re: [julia-users] [WIP] CSVReaders.jl

Reply via email to