Have you looked at https://github.com/JuliaStats/DataFrames.jl/pull/870? Many (but not all, I think) of these ideas are incorporated there.
On Wednesday, July 13, 2016 at 12:48:55 PM UTC-4, Douglas Bates wrote: > > I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because > the ModelFrame/ModelMatrix code badly needs refactoring. It was written > early in the development of Julia when I was still thinking R while writing > Julia. > > Part of the motivation is to fix problems with ModelMatrix in Julia > v0.5.0-dev using the new DataFrames forumlation using NullableArrays and > CategoricalArrays. > > One issue I encountered is expanding columns from terms involivng a > CategoricalArray or a PooledDataArray. If you have a NominalArray in a > model with an intercept, that term should generate k - 1 columns in the > model matrix. In R the reduced set of columns are called the contrasts. > Some will argue with that name (technically contrasts columns are defined > as being orthogonal to a constant column but that is no longer important). > > One way of generating contrasts is first to generate the matrix of > indicators then generate the desired contrasts. Sometimes it is simpler to > generate the contrasts matrix directly. > > Contrasts can be defined by a k by k-1 matrix. The default in R for > nominal arrays are the "treatment contrasts". The matrix defining these is > obtained by dropping the first column of an identity of size k. To > reproduce the parameterization used in SAS the last column of the identity > is dropped. For ordinal arrays polynomial contrasts are sometimes used. > > Currently there is an indicatormat generic in StatsBase that creates a > Matrix{Bool}, either sparse or dense, that is the transpose of the matrix > of indicator columns. That is, it is the Matrix{Bool} of indicator rows, > not columns. > > julia> indicatormat(repeat([1,2,3], inner=2)) > 3×6 Array{Bool,2}: > true true false false false false > false false true true false false > false false false false true true > > I suggest that indicatormat methods be defined for PooledDataArray and > CategoricalArray types too. > > Regarding contrasts, I think the contrasts generic should also be defined > in StatsBase. Methods would be defined in packages like DataArrays and > CategoricalArrays because they depend on the internal representation of the > array type. The primary method would be like > > function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray) > km1, k = size(m) > nlev = length(levels(a)) > if k ≠ nlev || km1 ≠ k - 1 > throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - > 1) × $nlev")) > end > m[:, a.refs] > end > > Contrast types can be expressed as functions that map k to a k - 1 x k > matrix. > > contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), > a) > > contrTreatment(k) = eye(k)[2:end, :] > > contrasts(a::NominalArray) = contrasts(contrTreatment, a) > > Most uses require the contrasts columns so we could consider whether to > stick with the indicatormat convention or to return contrast columns rather > than contrast rows. > -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
