Re: [Jchat] Analyzing differences between periods with multiple variables changing

Raul Miller Sun, 30 Mar 2014 17:22:27 -0700

On Sun, Mar 30, 2014 at 5:12 PM, Joe Bogner <[email protected]> wrote:


> > Not to mention a 25% reduction in cars...
> >
> > Yes, I don't think the reduction in cars matters here though


Oh?

You've a 9% reduction in cost and a 25% reduction in cars, that means
you've a 21% increase in cost per car, which implies a significant increase
from your gas guzzler which was compensated for by removing a car.

If you factored in the costs of the cars themselves you'd probably have a
very different picture (you'd have a penalty for disposal costs or a gain
from sales or some mixed bag from marketing and accounting plans - but in
any event your costs change based on how you judge them).


> > I'd be tempted to set up a variety of simple models for cost, assume a
> > linear correlation and then use %. to see what kinds of numbers I get
> from
> > those.
> >
> Models might be:
> >
> > constant cost
> > linear cost based on speed
> > cost based on square of speed
> > cost based distance driven
> > cost based on mpg
> >
>
> This is somewhat similar to the path I was starting to go down. I'm not
> exactly sure how to make your suggestions actionable yet in terms of a
> model. If it's relatively simple to explain, I would be very interested
> (and others may be too).
>
>
colN=:3 :0
  {.y&{"1`''
)
'`Period Car Hours Miles TotalCost'=: colN"0 i.5
constant=: 1"1
speed=: Miles % Hours
sqspeed=: speed ^ 2:
mpg=: Miles % TotalCost NB. assume cost is proportional to fuel consumed
distance=: Miles

Now, I've got a problem: I've got five models and I've only two months. So
I'm not going to be able to do a good correlation of models to underlying
data if I limit myself to monthly averages or totals or something like
that. Instead, I'll have to consider the monthly data in aggregate.

Month1=.(4 # 1),.(i. 4),.(4 # 0.5),.(30 30 30 15),.(4 # 2.75)

Month2=.(3 # 2),.(0 1 99),.(3 # 0.5),.(30, 25, 40),.(2.75, 2.5,3.0)

aggregate=: Month1,Month2

Now, let's say I want to see how total cost correlates to each of these
models. I think we're just concerned with trends here, so we should
probably normalize magnitudes.

normalize=: %"1 +/

Now let's look at the data.

   ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost)
aggregate
0.997644 1.00445 1.01226 1.00182 1.00445

A perfect match would give us a 1. Smaller than 1 indicates an element of
negative correlation while greater than 1 indicates an element of positive
correlation. So these are all pretty close. So let's go with occam's razor
(aka "pick the stupidest er... I mean simplest... thing that could possibly
work") and say that the constant contribution is something a given and we
want to focus on the changes which remain after removing that.

   ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost
-&normalize constant) aggregate
|NaN error

Ouch.

Looking at the underlying data:

   (TotalCost -&normalize constant) aggregate
0 0 0 0 0 _0.012987 0.012987

Our total cost is almost constant.

Let's try blaming the square of the speed instead, just for comparison
purposes:

   ((constant,.speed,.sqspeed,.mpg,.distance)%.&normalize TotalCost
-&normalize sqspeed) aggregate
4.31408e_32 4.90329e_17 8.94067e_17 4.17142e_17 4.90329e_17

Almost nothing left, but at least it's not so close to the kernel that we
get an error.

Basically there's very little variation in this data, and almost any
decision we make about assigning overall blame seems equally good (or
almost equally bad). But maybe we knew that already when we noticed we had
more models than months.

I sort of cooked up this analysis mechanism on the fly - it's a bit
opportunistic in character (you have to stumble across a well fitting
model, it really only tells you whether models or maybe some combination of
models are better or worse than each other). But hopefully it at least
shows you what I was trying to put into words.

I was either going to:
> 1. Calculate the would-be cost by hold each variable constant. Example:
> calculate the cost if the the miles were the same and the speed were the
> same and then changing one at a time.
>

For that you need a model that takes you from miles to cost. But if you had
that you might not need to do any analysis to see where it's going.

That said, if you had enough data you might be able to approximate by
selecting subsets of the data (based on one column) and pretending that
your selection function is your model function.

 2. Calculate the impact by the ratio of each change -- assuming each are
> linear and on the same scale. 10% reduction in miles should be a 10%
> reduction in cost assuming MPH is held constant... Something like that
>

 Same issue here, I imagine.

I'll keep thinking and welcome all other ideas
>
> I have some crude code started here too that's using inverted tables:
> https://gist.github.com/joebo/fd61043076beafeace30 , just to make it more
> concrete
>

Keep in mind that ultimately you'll have to verify your math by converting
what it says back into something more concrete. Ultimately it's someone's
understanding and effort which makes the difference, and math is just a
tool to give you alternate perspectives.

Thanks,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] Analyzing differences between periods with multiple variables changing

Reply via email to