Re: [julia-users] [WIP] CSVReaders.jl

John Myles White Mon, 08 Dec 2014 10:20:18 -0800

Looking at this again, the problem with doing reshape/transpose is that it's 
very awkward when trying to read data in a stream, since you need to undo the 
reshape and transpose before starting to read from the stream again. I think 
the best solution to getting a row-major matrix of data is to add a wrapper 
around the readall method from this package that handles the final reshape and 
transpose operations when you're not reading in streaming data.


 -- John

On Dec 8, 2014, at 9:25 AM, Tim Holy <[email protected]> wrote:

> Does the reshape/transpose really take any appreciable time (compared to the 
> I/O)?
> 
> --Tim
> 
> On Monday, December 08, 2014 09:14:35 AM John Myles White wrote:
>> Yes, this is how I've been doing things so far.
>> 
>> -- John
>> 
>> On Dec 8, 2014, at 9:12 AM, Tim Holy <[email protected]> wrote:
>>> My suspicion is you should read into a 1d vector (and use `append!`), then
>>> at the end do a reshape and finally a transpose. I bet that will be many
>>> times faster than any other alternative, because we have a really fast
>>> transpose now.
>>> 
>>> The only disadvantage I see is taking twice as much memory as would be
>>> minimally needed. (This can be fixed once we have row-major arrays.)
>>> 
>>> --Tim
>>> 
>>> On Monday, December 08, 2014 08:38:06 AM John Myles White wrote:
>>>> I believe/hope the proposed solution will work for most cases, although
>>>> there's still a bunch of performance work left to be done. I think the
>>>> decoupling problem isn't as hard as it might seem since there are very
>>>> clearly distinct stages in parsing a CSV file. But we'll find out if the
>>>> indirection I've introduced causes performance problems when things can't
>>>> be inlined.
>>>> 
>>>> While writing this package, I found the two most challenging problems to
>>>> be:
>>>> 
>>>> (A) The disconnect between CSV files providing one row at a time and
>>>> Julia's usage of column major arrays, which encourage reading one column
>>>> at a time. (B) The inability to easily resize! a matrix.
>>>> 
>>>> -- John
>>>> 
>>>> On Dec 8, 2014, at 5:16 AM, Stefan Karpinski <[email protected]> 
> wrote:
>>>>> Doh. Obfuscate the code quick, before anyone uses it! This is very nice
>>>>> and something I've always felt like we need for data formats like CSV –
>>>>> a
>>>>> way of decoupling the parsing of the format from the populating of a
>>>>> data
>>>>> structure with that data. It's a tough problem.
>>>>> 
>>>>> On Mon, Dec 8, 2014 at 8:08 AM, Tom Short <[email protected]>
>>>>> wrote:
>>>>> Exciting, John! Although your documentation may be "very sparse", the
>>>>> code
>>>>> is nicely documented.
>>>>> 
>>>>> On Mon, Dec 8, 2014 at 12:35 AM, John Myles White
>>>>> <[email protected]> wrote: Over the last month or so, I've been
>>>>> slowly working on a new library that defines an abstract toolkit for
>>>>> writing CSV parsers. The goal is to provide an abstract interface that
>>>>> users can implement in order to provide functions for reading data into
>>>>> their preferred data structures from CSV files. In principle, this
>>>>> approach should allow us to unify the code behind Base's readcsv and
>>>>> DataFrames's readtable functions.
>>>>> 
>>>>> The library is still very much a work-in-progress, but I wanted to let
>>>>> others see what I've done so that I can start getting feedback on the
>>>>> design.
>>>>> 
>>>>> Because the library makes heavy use of Nullables, you can only try out
>>>>> the
>>>>> library on Julia 0.4. If you're interested, it's available at
>>>>> https://github.com/johnmyleswhite/CSVReaders.jl
>>>>> 
>>>>> For now, I've intentionally given very sparse documentation to
>>>>> discourage
>>>>> people from seriously using the library before it's officially released.
>>>>> But there are some examples in the README that should make clear how the
>>>>> library is intended to be used.>
>>>>> -- John
>

Re: [julia-users] [WIP] CSVReaders.jl

Reply via email to