Inline comments.
On Sun, Mar 30, 2014 at 4:14 PM, Joe Bogner <[email protected]> wrote:
> I am analyzing some data and would like ideas on how to explain differences
> between periods. The suggestions can either be in J or completely outside
> of J. For example, perhaps I need to do some sort of weighted poisson
> regression. Or maybe the answer is that there's not enough information to
> draw any inferences.
>
> I would like to answer:
>
> 1. How much of the overall change in Total Cost Per mile can be attributed
> to different cars being driven?
2. How much can be attributed to less/more fuel efficient cars being driven
> farther?
> 3. How much of the change can be attributed to fuel cost?
> 4. How much of the change can be attributed to change in speed?
>
These are fuzzy questions, because they assume an attribution model.
It looks to me like questions 1 and 2 are not independent of each other.
Also, logically speaking question 4 should depend on 1, 2 and 3 as well as
on independent issues. Also there's going to be a certain amount of noise
in the data.
> My first cut is to look at what's the same and explain the differences
> between matching records.
>
> Here is fake data for 4 cars in period 1. All cars have traveled for 1/2
> hour, for 30 minutes at a cost of $2.75.
>
> ]Month1=.(4 # 1),.(i. 4),.(4 # 0.5),.(4 # 30),.(4 # 2.75)
> 1 0 0.5 30 2.75
> 1 1 0.5 30 2.75
> 1 2 0.5 30 2.75
> 1 3 0.5 30 2.75
>
Period, car id, time driven (expressed in hours) time driven (expressed in
minutes), cost of gas.
Sure you didn't mean for one of those to be distance driven?
> Let's replace one of the cars with a gas guzzler
>
> ]Month1=.(4 # 1),.(i. 4),.(4 # 0.5),.(30 30 30 15),.(4 # 2.75)
> 1 0 0.5 30 2.75
> 1 1 0.5 30 2.75
> 1 2 0.5 30 2.75
> 1 3 0.5 15 2.75
>
> The 4th car only went 15 miles on $2.75 of gas
>
Looks like column 3 should really be time driven (expressed in miles).
>
> ] 4{"1 Month1 % 3{"1 Month1
> 0.0916667 0.0916667 0.0916667 0.183333
>
I'd be more comfortable if you put parenthesis in the middle there:
] (4{"1 Month1) % 3{"1 Month1
0.0916667 0.0916667 0.0916667 0.183333
This does not change the answer but it saves me from wanting to wonder
about things like car id divided by dollars.
> Here is fake data for 3 cars in period 2. There are 3 cars that are similar
> and one new car.
>
> ]Month2=.(3 # 2),.(0 1 99),.(3 # 0.5),.(30, 25, 40),.(2.75, 2.5,3.0)
> 2 0 0.5 30 2.75
> 2 1 0.5 25 2.5
> 2 99 0.5 40 3
>
That would be four cars, no? Or am I harping on inconsequential grammar?
> I can calculate cost per mile:
>
> cPM=: [: +/ 4{"1 ] % [: +/ 3{"1 ]
>
This works, though I am again bemused by intermediate results.
> ] cPM Month1
> 0.104762
>
> ] cPM Month2
> 0.0868421
>
> The total cost per mile went down in Month 2 because the gas guzzler
> (hummer? -- car #3) is gone and car 99 was added with better fuel economy,
> and car #2's cost per mile got slightly worse.
>
> We can look at these matches to better understand:
>
> ]Month1Matches=:((1{"1 Month1) e. (1{"1 Month2)) # Month1
> 1 0 0.5 30 2.75
> 1 1 0.5 30 2.75
>
> ]Month2Matches=:((1{"1 Month2) e. (1{"1 Month1)) # Month2
> 2 0 0.5 30 2.75
> 2 1 0.5 25 2.5
>
> Car #1 traveled 5 fewer miles and the total cost changed by 25 cents
>
Also:
2.75 2.5%30 25
0.0916667 0.1
The 25 mile drive gets slightly better economy.
> 16% reduction in miles
>
> (3{"1 Month2Matches % 3{"1 Month1Matches)-1
> 0 _0.166667
>
Not to mention a 25% reduction in cars...
> And only a 9% reduction in cost
>
> (4{"1 Month2Matches % 4{"1 Month1Matches)-1
> 0 _0.0909091
>
> One possible explanation is the fuel cost went up.
>
Yes.
> Using the 30 minutes of drive time, I can conclude, the car also slowed
> down to an average speed of 50 mph.
>
> At this point I feel like I'm starting to get stuck and making fuzzy
> assumptions. There seems like there should be a better way. If there were
> hundreds or thousands of records, would that change things?
>
Yes.
> Thanks for any ideas
>
I'd be tempted to set up a variety of simple models for cost, assume a
linear correlation and then use %. to see what kinds of numbers I get from
those.
Models might be:
constant cost
linear cost based on speed
cost based on square of speed
cost based distance driven
cost based on mpg
...
If two models are closely related that'll mess up this kind of analysis but
you can see that rather easily by removing a related model from a pair of
models and seeing how that alters the results. (like if you had square of
distance driven and square of speed, maybe.. or if "depends entirely on
model 0" vs. "depends entirely on model 1" were in there).
You can get fancier, of course, but sometimes a good approximation is
better than a more precise result.
Thanks,
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm