Right, indeed I meant to suggest making the conversion to matrix form the very last step of the process. But obviously you didn't need that suggestion :-).
--Tim On Monday, December 08, 2014 10:20:00 AM John Myles White wrote: > Looking at this again, the problem with doing reshape/transpose is that it's > very awkward when trying to read data in a stream, since you need to undo > the reshape and transpose before starting to read from the stream again. I > think the best solution to getting a row-major matrix of data is to add a > wrapper around the readall method from this package that handles the final > reshape and transpose operations when you're not reading in streaming data. > > -- John > > On Dec 8, 2014, at 9:25 AM, Tim Holy <[email protected]> wrote: > > Does the reshape/transpose really take any appreciable time (compared to > > the I/O)? > > > > --Tim > > > > On Monday, December 08, 2014 09:14:35 AM John Myles White wrote: > >> Yes, this is how I've been doing things so far. > >> > >> -- John > >> > >> On Dec 8, 2014, at 9:12 AM, Tim Holy <[email protected]> wrote: > >>> My suspicion is you should read into a 1d vector (and use `append!`), > >>> then > >>> at the end do a reshape and finally a transpose. I bet that will be many > >>> times faster than any other alternative, because we have a really fast > >>> transpose now. > >>> > >>> The only disadvantage I see is taking twice as much memory as would be > >>> minimally needed. (This can be fixed once we have row-major arrays.) > >>> > >>> --Tim > >>> > >>> On Monday, December 08, 2014 08:38:06 AM John Myles White wrote: > >>>> I believe/hope the proposed solution will work for most cases, although > >>>> there's still a bunch of performance work left to be done. I think the > >>>> decoupling problem isn't as hard as it might seem since there are very > >>>> clearly distinct stages in parsing a CSV file. But we'll find out if > >>>> the > >>>> indirection I've introduced causes performance problems when things > >>>> can't > >>>> be inlined. > >>>> > >>>> While writing this package, I found the two most challenging problems > >>>> to > >>>> be: > >>>> > >>>> (A) The disconnect between CSV files providing one row at a time and > >>>> Julia's usage of column major arrays, which encourage reading one > >>>> column > >>>> at a time. (B) The inability to easily resize! a matrix. > >>>> > >>>> -- John > >>>> > >>>> On Dec 8, 2014, at 5:16 AM, Stefan Karpinski <[email protected]> > > > > wrote: > >>>>> Doh. Obfuscate the code quick, before anyone uses it! This is very > >>>>> nice > >>>>> and something I've always felt like we need for data formats like CSV > >>>>> – > >>>>> a > >>>>> way of decoupling the parsing of the format from the populating of a > >>>>> data > >>>>> structure with that data. It's a tough problem. > >>>>> > >>>>> On Mon, Dec 8, 2014 at 8:08 AM, Tom Short <[email protected]> > >>>>> wrote: > >>>>> Exciting, John! Although your documentation may be "very sparse", the > >>>>> code > >>>>> is nicely documented. > >>>>> > >>>>> On Mon, Dec 8, 2014 at 12:35 AM, John Myles White > >>>>> <[email protected]> wrote: Over the last month or so, I've been > >>>>> slowly working on a new library that defines an abstract toolkit for > >>>>> writing CSV parsers. The goal is to provide an abstract interface that > >>>>> users can implement in order to provide functions for reading data > >>>>> into > >>>>> their preferred data structures from CSV files. In principle, this > >>>>> approach should allow us to unify the code behind Base's readcsv and > >>>>> DataFrames's readtable functions. > >>>>> > >>>>> The library is still very much a work-in-progress, but I wanted to let > >>>>> others see what I've done so that I can start getting feedback on the > >>>>> design. > >>>>> > >>>>> Because the library makes heavy use of Nullables, you can only try out > >>>>> the > >>>>> library on Julia 0.4. If you're interested, it's available at > >>>>> https://github.com/johnmyleswhite/CSVReaders.jl > >>>>> > >>>>> For now, I've intentionally given very sparse documentation to > >>>>> discourage > >>>>> people from seriously using the library before it's officially > >>>>> released. > >>>>> But there are some examples in the README that should make clear how > >>>>> the > >>>>> library is intended to be used.> > >>>>> -- John
