Re: [julia-users] Dataframe readtable change?

John Myles White Thu, 22 May 2014 13:17:01 -0700

The original change that summarized large DataFrames was introduced by Julia 
Evans and brought us closer into sync with pandas. I've been really happy with 
it.


Regarding the old way of doing things, I think you should revert to the old 
display rules for a while and try them again before making up your mind about 
your preferences. The old display rule was completely illegible for almost 
every data set that is currently being summarized. And I mean completely 
illegible, not just ugly.

One change to formatting that I'd be happy with would be to default to showing 
the output of show(df, true) for all tables and never showing the column 
summaries unless explicitly requested. It seems like this default is the thing 
people most strongly dislike.

We could remove the ASCII chrome, but I think it's a good idea. MySQL, Hive and 
Presto all use the same kind of explicit tabular structure when rendering 
tables. I think making DataFrames behave more like traditional databases is a 
good thing since it encourages people not to think of them as they were 
matrices.

The padding also makes it much easier to copy-and-paste tables since they're 
valid Markdown tables that any Markdown renderer can easily convert into Tex, 
HTML, etc.

 -- John

On May 22, 2014, at 1:02 PM, Stefan Karpinski <[email protected]> wrote:

> For what it's worth, I was much happier when dataframes showed their contents 
> rather than a summary. I must have missed the discussion where that decision 
> was made (ditto for all the extra ASCII chrome when displaying data frames 
> these days).
> 
> 
> On Thu, May 22, 2014 at 3:01 PM, John Myles White <[email protected]> 
> wrote:
> Nobody had time to integrate it anywhere. A pull request would help move 
> things forward.
> 
>  -- John
> 
> On May 22, 2014, at 11:57 AM, Bob Nnamtrop <[email protected]> wrote:
> 
>> OK. Thanks. That is helpful.
>> 
>> Any reason why that page is not shown in the documentation given in the link 
>> on the front page.
>> 
>> 
>> On Thu, May 22, 2014 at 11:46 AM, John Myles White 
>> <[email protected]> wrote:
>> head and tail don't actually print anything: they just give you a subset of 
>> a DataFrame. So you're seeing the usual show method's output, which can be 
>> overriden by explicitly requesting that you see the whole DataFrame. See
>> 
>> https://github.com/JuliaStats/DataFrames.jl/blob/master/spec/show.md
>> 
>>  -- John
>> 
>> On May 22, 2014, at 10:44 AM, Bob Nnamtrop <[email protected]> wrote:
>> 
>>>  An issue I noticed with Dataframes recently is that head(df) and tail(df) 
>>> both list the show(df) summary (like those above) instead of listing the 
>>> top and bottom of the dataframe. I just started using dataframes so I have 
>>> no idea what they did in the past but it seems they should list the df and 
>>> not the summary.
>>> 
>>> Also, are there any other handy ways to list the df in the repl?
>>> 
>>> Bob
>>> 
>>> 
>>> On Thu, May 22, 2014 at 11:39 AM, Rob J. Goedman <[email protected]> wrote:
>>> Thanks John.
>>> 
>>> I should have filed it as an issue on DataFrames.jl but initially thought 
>>> it could deeper than that.
>>> 
>>> For now in Stan.jl I've included a 'small' cleanup step. Small for say 1000 
>>> samples, a bit bigger for 100000 samples.
>>> 
>>> Like you mentioned earlier, for years I've been using 
>>> file-out-file-in-communication for Jags and other programs (Finite 
>>> Elements) and was quite ok with it because sampling and FE iterations 
>>> dominated the time to complete.
>>> 
>>> FOFI really only became an issue when I had to adjust values in between 
>>> each of hundreds of runs (e.g. a stiffness matrix in FEM when dealing with 
>>> buckling).
>>> 
>>> Rob J. Goedman
>>> [email protected]
>>> 
>>> 
>>> 
>>> 
>>> On May 22, 2014, at 10:16 AM, John Myles White <[email protected]> 
>>> wrote:
>>> 
>>>> I need to find time to look into this, but could someone try a git bisect 
>>>> and see if some of the metaprogramming changes we made to readtable caused 
>>>> this? It might be that this file would have never worked, but if it once 
>>>> did, it would be good to point out the problematic code.
>>>> 
>>>>  — John
>>>> 
>>>> On May 20, 2014, at 7:53 PM, Rob J. Goedman <[email protected]> wrote:
>>>> 
>>>>> Actually, another way to make it work is removing the blank line. Below 
>>>>> little program shows that readtable() accepts test_df1 and test_df2, but 
>>>>> fails on test_df3.
>>>>> 
>>>>> Also, the fact that it started to happen today had nothing todo with 
>>>>> Julia or DataFrame updates. The file is created by Stan and the latest 
>>>>> version inserts that blank line.
>>>>> 
>>>>> Of course I could clean up the file, but maybe this is an issue in 
>>>>> DataFrame's readtable function?
>>>>> 
>>>>> Apologies for the earlier incomplete report.
>>>>> 
>>>>> Rob J. Goedman
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> <test_df.jl><test_df1.csv>
>>>>> <test_df2.csv>
>>>>> <test_df3.csv>
>>>>> 
>>>>> 
>>>>> julia> 
>>>>> include("/Users/rob/.julia/v0.3/MCMCExampleRepository/test/test_df.jl")
>>>>> 4x10 DataFrame
>>>>> |-------|---------------|---------|---------|
>>>>> | Col # | Name          | Eltype  | Missing |
>>>>> | 1     | lp__          | Float64 | 0       |
>>>>> | 2     | accept_stat__ | Float64 | 0       |
>>>>> | 3     | stepsize__    | Float64 | 0       |
>>>>> | 4     | treedepth__   | Int64   | 0       |
>>>>> | 5     | n_leapfrog__  | Int64   | 0       |
>>>>> | 6     | n_divergent__ | Int64   | 0       |
>>>>> | 7     | beta_1        | Float64 | 0       |
>>>>> | 8     | beta_2        | Float64 | 0       |
>>>>> | 9     | beta_3        | Float64 | 0       |
>>>>> | 10    | sigma         | Float64 | 0       |
>>>>> 
>>>>> 4x10 DataFrame
>>>>> |-------|---------------|---------|---------|
>>>>> | Col # | Name          | Eltype  | Missing |
>>>>> | 1     | lp__          | Float64 | 0       |
>>>>> | 2     | accept_stat__ | Float64 | 0       |
>>>>> | 3     | stepsize__    | Float64 | 0       |
>>>>> | 4     | treedepth__   | Int64   | 0       |
>>>>> | 5     | n_leapfrog__  | Int64   | 0       |
>>>>> | 6     | n_divergent__ | Int64   | 0       |
>>>>> | 7     | beta_1        | Float64 | 0       |
>>>>> | 8     | beta_2        | Float64 | 0       |
>>>>> | 9     | beta_3        | Float64 | 0       |
>>>>> | 10    | sigma         | Float64 | 0       |
>>>>> 
>>>>> ERROR: BoundsError()
>>>>>  in findcorruption at 
>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:663
>>>>>  in readtable! at 
>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:731
>>>>>  in readtable at /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:812
>>>>>  in readtable at /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:879
>>>>>  in include at boot.jl:244
>>>>> while loading 
>>>>> /Users/rob/.julia/v0.3/MCMCExampleRepository/test/test_df.jl, in 
>>>>> expression starting on line 11
>>>>> 
>>>>> julia> 
>>>>> 
>>>>> 
>>>>> On May 20, 2014, at 6:36 PM, Rob J. Goedman <[email protected]> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Using a freshly updated Version 0.3.0-prerelease+3251 (2014-05-20 23:18 
>>>>>> UTC) of Julia I think I noticed a different behavior of readtable(), 
>>>>>> which I hope is not intended.
>>>>>> 
>>>>>> I have a small test file with data as shown below (and attached as a 
>>>>>> file at the end of the email):
>>>>>> 
>>>>>> lp__,accept_stat__,stepsize__,treedepth__,n_leapfrog__,n_divergent__,mu
>>>>>> # Adaptation terminated
>>>>>> 
>>>>>> -19.8871,0.975123,0.303529,4,15,0,4.25051
>>>>>> -22.1208,0.971631,0.303529,3,7,0,8.55276
>>>>>> -23.8336,0.857954,0.303529,4,15,0,4.41087
>>>>>> 
>>>>>> If I remove the commented line ("# Adaptation terminated"), readtable() 
>>>>>> has no problem, but if it's there readtable() seems to ignore the 
>>>>>> 'allowcomments=true'.
>>>>>> 
>>>>>> I didn't update DataFrames as far as I am aware, but once or twice today 
>>>>>> I did pull Julia's master from github.
>>>>>> 
>>>>>> I wonder if someone could try this example. Thanks a lot.
>>>>>> 
>>>>>> Rob J. Goedman
>>>>>> [email protected]
>>>>>> 
>>>>>> 
>>>>>> julia> df = readtable("schools8_samples.csv", allowcomments=true)
>>>>>> ERROR: Saw 4 rows, 5 columns and 22 fields
>>>>>>  * Line 1 has 3 columns
>>>>>> 
>>>>>>  in error at error.jl:21
>>>>>>  in findcorruption at 
>>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:680
>>>>>>  in readtable! at 
>>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:731
>>>>>>  in readtable at 
>>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:812
>>>>>>  in readtable at 
>>>>>> /Users/rob/.julia/v0.3/DataFrames/src/dataframe/io.jl:879
>>>>>> 
>>>>>> julia> df = readtable("schools8_samples.csv", allowcomments=true)
>>>>>> 3x7 DataFrame
>>>>>> |-------|---------------|---------|---------|
>>>>>> | Col # | Name          | Eltype  | Missing |
>>>>>> | 1     | lp__          | Float64 | 0       |
>>>>>> | 2     | accept_stat__ | Float64 | 0       |
>>>>>> | 3     | stepsize__    | Float64 | 0       |
>>>>>> | 4     | treedepth__   | Int64   | 0       |
>>>>>> | 5     | n_leapfrog__  | Int64   | 0       |
>>>>>> | 6     | n_divergent__ | Int64   | 0       |
>>>>>> | 7     | mu            | Float64 | 0       |
>>>>>> 
>>>>>> 
>>>>>> <schools8_samples.csv>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [julia-users] Dataframe readtable change?

Reply via email to