Re: [Jchat] Analyzing differences between periods with multiple variables changing

Joe Bogner Mon, 31 Mar 2014 04:03:32 -0700

Thank you as always for the thoughtful and detailed reply. I'm still
working through the examples


On Sun, Mar 30, 2014 at 8:21 PM, Raul Miller <[email protected]> wrote:

>
> You've a 9% reduction in cost and a 25% reduction in cars, that means
> you've a 21% increase in cost per car, which implies a significant increase
> from your gas guzzler which was compensated for by removing a car.
>
> Yes, but I don't think it's correct to spread the cost across all the cars
for what I'm trying to achieve.


> If you factored in the costs of the cars themselves you'd probably have a
> very different picture (you'd have a penalty for disposal costs or a gain
> from sales or some mixed bag from marketing and accounting plans - but in
> any event your costs change based on how you judge them).
>
>
I agree, that could be an interesting way to look at at it.


> colN=:3 :0
>   {.y&{"1`''
> )
> '`Period Car Hours Miles TotalCost'=: colN"0 i.5
>

In the back of my head I was hoping there was a way to select columns by
name without having to define a verb for each. This is a really neat
approach. Thank you for sharing.


>    ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost)
> aggregate
> 0.997644 1.00445 1.01226 1.00182 1.00445
>
>
It took me a little bit to figure out what this was doing. There are many
ways to describe it, but my simple explanation is that it's testing for the
relationship between % of variable to the % of total cost. It's doing an
ordinary least squares or linear regression on each of the inputs that have
been transformed to be the % of the column's total

For example, the speed column above was 1.00445, which can be calculated on
its own as:

] (normalize speed aggregate)%.(normalize TotalCost aggregate)

1.00445


In R terms, that would be:

> coefficients(lm(z$Speed~z$TotalCost+0))
z$TotalCost
   1.004446

Where z was defined as:
z<-lapply(df, function(x) { x/sum(x) })

And df is a data.frame of the values

df <- data.frame(Period=c(rep(1,4), rep(2,3)), Car=c(0,1,2,3,0,1,99),
Hours=rep(0.5,7), Miles=c(30,30,30,15,30,25,40),
TotalCost=c(rep(2.75,4),2.75,2.5, 3) )
df$Speed <- df$Miles / df$Hours

I am getting my head wrapped around why it's ] (normalize TotalCost
aggregate)%.(normalize speed aggregate)
I would have thought that it was the ratio of TotalCost as a function of
the ratio of Speed where TotalCost is the dependent variable. I will have
to think this through more.


> A perfect match would give us a 1. Smaller than 1 indicates an element of
> negative correlation while greater than 1 indicates an element of positive
> correlation. So these are all pretty close. So let's go with occam's razor
> (aka "pick the stupidest er... I mean simplest... thing that could possibly
> work") and say that the constant contribution is something a given and we
> want to focus on the changes which remain after removing that.
>
>
Makes sense


>    ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost
> -&normalize constant) aggregate
> |NaN error
>
> Ouch.
>
> Looking at the underlying data:
>
>    (TotalCost -&normalize constant) aggregate
> 0 0 0 0 0 _0.012987 0.012987
>
>
Now it makes sense why constant is included. It simplifies removing the
constant contribution concept. Said a different way, I think it lets us
test whether each variable changes at the same rate relative to Total Cost.
More thought needed here.


> Our total cost is almost constant.
>
> Let's try blaming the square of the speed instead, just for comparison
> purposes:
>
>    ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost
> -&normalize sqspeed) aggregate
> 4.31408e_32 4.90329e_17 8.94067e_17 4.17142e_17 4.90329e_17
>
> Almost nothing left, but at least it's not so close to the kernel that we
> get an error.
>
> Basically there's very little variation in this data, and almost any
> decision we make about assigning overall blame seems equally good (or
> almost equally bad). But maybe we knew that already when we noticed we had
> more models than months.
>
>
I'll work up a few more examples and test out this concept more. It looks
like it has some potential

Thanks
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] Analyzing differences between periods with multiple variables changing

Reply via email to