I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because
the ModelFrame/ModelMatrix code badly needs refactoring. It was written
early in the development of Julia when I was still thinking R while writing
Julia.
Part of the motivation is to fix problems with ModelMatrix in Julia
v0.5.0-dev using the new DataFrames forumlation using NullableArrays and
CategoricalArrays.
One issue I encountered is expanding columns from terms involivng a
CategoricalArray or a PooledDataArray. If you have a NominalArray in a
model with an intercept, that term should generate k - 1 columns in the
model matrix. In R the reduced set of columns are called the contrasts.
Some will argue with that name (technically contrasts columns are defined
as being orthogonal to a constant column but that is no longer important).
One way of generating contrasts is first to generate the matrix of
indicators then generate the desired contrasts. Sometimes it is simpler to
generate the contrasts matrix directly.
Contrasts can be defined by a k by k-1 matrix. The default in R for
nominal arrays are the "treatment contrasts". The matrix defining these is
obtained by dropping the first column of an identity of size k. To
reproduce the parameterization used in SAS the last column of the identity
is dropped. For ordinal arrays polynomial contrasts are sometimes used.
Currently there is an indicatormat generic in StatsBase that creates a
Matrix{Bool}, either sparse or dense, that is the transpose of the matrix
of indicator columns. That is, it is the Matrix{Bool} of indicator rows,
not columns.
julia> indicatormat(repeat([1,2,3], inner=2))
3×6 Array{Bool,2}:
true true false false false false
false false true true false false
false false false false true true
I suggest that indicatormat methods be defined for PooledDataArray and
CategoricalArray types too.
Regarding contrasts, I think the contrasts generic should also be defined
in StatsBase. Methods would be defined in packages like DataArrays and
CategoricalArrays because they depend on the internal representation of the
array type. The primary method would be like
function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray)
km1, k = size(m)
nlev = length(levels(a))
if k ≠ nlev || km1 ≠ k - 1
throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - 1)
× $nlev"))
end
m[:, a.refs]
end
Contrast types can be expressed as functions that map k to a k - 1 x k
matrix.
contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), a)
contrTreatment(k) = eye(k)[2:end, :]
contrasts(a::NominalArray) = contrasts(contrTreatment, a)
Most uses require the contrasts columns so we could consider whether to
stick with the indicatormat convention or to return contrast columns rather
than contrast rows.
--
You received this message because you are subscribed to the Google Groups
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.