[julia-stats] RFC: Generics for indicators, constrasts, etc. in StatsBase?

Douglas Bates Wed, 13 Jul 2016 09:49:23 -0700

I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because 
the ModelFrame/ModelMatrix code badly needs refactoring.  It was written 
early in the development of Julia when I was still thinking R while writing 
Julia.


 Part of the motivation is to fix problems with ModelMatrix in Julia 
v0.5.0-dev using the new DataFrames forumlation using NullableArrays and 
CategoricalArrays.

One issue I encountered is expanding columns from terms involivng a 
CategoricalArray or a PooledDataArray.  If you have a NominalArray in a 
model with an intercept, that term should generate k - 1 columns in the 
model matrix.  In R the reduced set of columns are called the contrasts. 
 Some will argue with that name (technically contrasts columns are defined 
as being orthogonal to a constant column but that is no longer important).

One way of generating contrasts is first to generate the matrix of 
indicators then generate the desired contrasts.  Sometimes it is simpler to 
generate the contrasts matrix directly.

Contrasts can be defined by a k by k-1 matrix.  The default in R for 
nominal arrays are the "treatment contrasts".  The matrix defining these is 
obtained by dropping the first column of an identity of size k.  To 
reproduce the parameterization used in SAS the last column of the identity 
is dropped.  For ordinal arrays polynomial contrasts are sometimes used.

Currently there is an indicatormat generic in StatsBase that creates a 
Matrix{Bool}, either sparse or dense, that is the transpose of the matrix 
of indicator columns.  That is, it is the Matrix{Bool} of indicator rows, 
not columns.

julia> indicatormat(repeat([1,2,3], inner=2))
3×6 Array{Bool,2}:
  true   true  false  false  false  false
 false  false   true   true  false  false
 false  false  false  false   true   true

I suggest that indicatormat methods be defined for PooledDataArray and 
CategoricalArray types too.

Regarding contrasts, I think the contrasts generic should also be defined 
in StatsBase.  Methods would be defined in packages like DataArrays and 
CategoricalArrays because they depend on the internal representation of the 
array type.  The primary method would be like

function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray)
    km1, k = size(m)
    nlev = length(levels(a))
    if k ≠ nlev || km1 ≠ k - 1
        throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - 1) 
× $nlev"))
    end
    m[:, a.refs]
end

Contrast types can be expressed as functions that map k to a k - 1 x k 
matrix.

contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), a)

contrTreatment(k) = eye(k)[2:end, :]

contrasts(a::NominalArray) = contrasts(contrTreatment, a)

Most uses require the contrasts columns so we could consider whether to 
stick with the indicatormat convention or to return contrast columns rather 
than contrast rows. 

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[julia-stats] RFC: Generics for indicators, constrasts, etc. in StatsBase?

Reply via email to