Have you looked at https://github.com/JuliaStats/DataFrames.jl/pull/870? 
Many (but not all, I think) of these ideas are incorporated there.

On Wednesday, July 13, 2016 at 12:48:55 PM UTC-4, Douglas Bates wrote:
>
> I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because 
> the ModelFrame/ModelMatrix code badly needs refactoring.  It was written 
> early in the development of Julia when I was still thinking R while writing 
> Julia.
>
>  Part of the motivation is to fix problems with ModelMatrix in Julia 
> v0.5.0-dev using the new DataFrames forumlation using NullableArrays and 
> CategoricalArrays.
>
> One issue I encountered is expanding columns from terms involivng a 
> CategoricalArray or a PooledDataArray.  If you have a NominalArray in a 
> model with an intercept, that term should generate k - 1 columns in the 
> model matrix.  In R the reduced set of columns are called the contrasts. 
>  Some will argue with that name (technically contrasts columns are defined 
> as being orthogonal to a constant column but that is no longer important).
>
> One way of generating contrasts is first to generate the matrix of 
> indicators then generate the desired contrasts.  Sometimes it is simpler to 
> generate the contrasts matrix directly.
>
> Contrasts can be defined by a k by k-1 matrix.  The default in R for 
> nominal arrays are the "treatment contrasts".  The matrix defining these is 
> obtained by dropping the first column of an identity of size k.  To 
> reproduce the parameterization used in SAS the last column of the identity 
> is dropped.  For ordinal arrays polynomial contrasts are sometimes used.
>
> Currently there is an indicatormat generic in StatsBase that creates a 
> Matrix{Bool}, either sparse or dense, that is the transpose of the matrix 
> of indicator columns.  That is, it is the Matrix{Bool} of indicator rows, 
> not columns.
>
> julia> indicatormat(repeat([1,2,3], inner=2))
> 3×6 Array{Bool,2}:
>   true   true  false  false  false  false
>  false  false   true   true  false  false
>  false  false  false  false   true   true
>
> I suggest that indicatormat methods be defined for PooledDataArray and 
> CategoricalArray types too.
>
> Regarding contrasts, I think the contrasts generic should also be defined 
> in StatsBase.  Methods would be defined in packages like DataArrays and 
> CategoricalArrays because they depend on the internal representation of the 
> array type.  The primary method would be like
>
> function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray)
>     km1, k = size(m)
>     nlev = length(levels(a))
>     if k ≠ nlev || km1 ≠ k - 1
>         throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - 
> 1) × $nlev"))
>     end
>     m[:, a.refs]
> end
>
> Contrast types can be expressed as functions that map k to a k - 1 x k 
> matrix.
>
> contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), 
> a)
>
> contrTreatment(k) = eye(k)[2:end, :]
>
> contrasts(a::NominalArray) = contrasts(contrTreatment, a)
>
> Most uses require the contrasts columns so we could consider whether to 
> stick with the indicatormat convention or to return contrast columns rather 
> than contrast rows. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to