Re: Efficient counting of results

Israel Brewster Fri, 20 Oct 2017 10:08:53 -0700

On Oct 19, 2017, at 5:18 PM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote:
> What t1 and t2 are, I have no idea. Your code there suggests that they are
> fields in your data records, but the contents of the fields, who knows?


t1 and t2 are *independent* timestamp fields. My apologies - I made the 
obviously false assumption that it was clear they were timestamps, or at least 
times based on the fact I was calculating "minutes late" based on them.

> 
> 
>> d10, w10, m10, y10, d25, w25, m25 AND y25
> 
> Try using descriptive variable names rather than these cryptic codes.

I did. In my original post I showed the table with names like "t1 1-5min". 
Granted, that's for illustration purposes, not actual code, but still, more 
descriptive. These codes were just to keep consistent with the alternative data 
format suggested :-)

> 
> I don't understand what is *actually* being computed here -- you say that t1
> is "on time" and t2 is "5 minutes late", but that's a contradiction: how can
> a single record be both on time and 5 minutes late?

Easily: because the record contains two DIFFERENT times. Since you want more 
concrete, we're talking departure and arrival times here. Quite easy to depart 
on-time, but arrive late, or depart late but arrive on-time.

> It also contradicts your statement that it is *date* and *key* that determines
> which late bin to use.

I never made such a statement. I said they are used to determine "WHAT on-time 
IS for the record", not WHETHER the record is on-time or not, and certainly not 
which late bin to use. To put it a different way, those are the key to a lookup 
table that tells me what T1 and T2 are *supposed* to be in order for *each one* 
to be on time.

So, for example, to completely make up some data (since it doesn't matter in 
the slightest), date could be 10/5/17 with a key of 42 (Let's say that is a 
driver ID to keep things concrete for you), and using those values tells me 
(via the lookup table) that on 10/5/17, 42 should have a T1 of 10:15 and a T2 
of 11:30. As we said, those would be departure and arrival times, so what we're 
saying is that on 10/5, driver #42 was *scheduled* to depart at 10:15 and 
arrive at their destination at 11:30. So if T1 was *actually* 10:14, and T2 
was, say 11:35, then I could say that T1 was on-time (actually, a minute early, 
but that doesn't matter), while T2 was 5 minutes late. Maybe traffic was 
horrible, or he had a flat. 

However, if the date changed to 9/1/17 (with the key still being 42), there 
could be a completely different schedule, with completely different "late" 
results, even if the *actual* values of t1 and t2 don't change. Maybe he was 
supposed to make the run early on that day, say 10:00-11:15, but forgot and 
left at the same time as he was used to, thereby making him 14 minutes late 
departing. and really late arriving, or something.

> Rather, it seems that date and key are irrelevant and
> can be ignored, it is only t1 and t2 which determine which late bins to
> update.

Except that then we have no way to know what t1 and t2 *should* be. You 
apparently made the assumption that t1 and t2 should always be some fixed 
value. In fact, what t1 and t2 should be varies based on date and key (see the 
driver example above, or Chris Angelico's pizza example also works well). For 
any given date, there are dozens of different keys with different expected 
values of t1 and t2 (in the pizza example Chris gave the key might be order 
number), and for any given key, the expected value of t1 and t2 could vary 
based on what date it is (say we restart order numbers from 1 each day to make 
it easy to know how many orders we've done that day, or, of course, same driver 
different day, depending on which example you prefer).

> 
> Another question: you're talking about *dates*, which implies a resolution of
> 1 day, but then you talk about records being "five minutes late" which
> implies a resolution of at least five minutes and probably one minute, if not
> seconds or milliseconds. Which is it? My guess is that you're probably
> talking about *timestamps* (datetimes) rather than *dates*.

As stated, the data has two timestamp fields T1 and T2. So yes, the resolution 
of the data is "one minute" (we ignore sub-minute timings). However (and this 
addresses your understanding below as well), I am trying to get data for the 
date, week-to-date, month-to-date, and year-to-date. So there is four different 
"date" resolution bins in addition to the "minute" resolution bins.

Perhaps a better approach to explaining is to pose the question the report is 
trying to answer:

For the given date, how many departures were on time? How many were 1-5 minutes 
late? 6-15 minutes late? What about this week: how many on-time, 1-5 minutes 
late, etc? What about this entire month (including the given date)? What about 
this year (again, including the given date and month)? How about arrivals - 
same questions. 

As you can hopefully see now, if a departure happened this week, it probably 
also happened this month (although that is not necessarily the case, since 
weeks can cross month boundaries), and if it happened this date or this month, 
it *definitely* happened this year. As such, a given departure *likely* will be 
counted in multiple date "groups", if you will.

The end result should be a table like the one I posted in the original 
question: time frame covered on the horizontal axis (YTD, MTD etc.), and "late" 
groups for T1 and T2 on the vertical.

> You want to process all your records, and decide "as of
> now, how late is each record", and then report *cumulative* subtotals for a
> number of arbitrary groups: not late yet, five minutes late, one day late,
> one year late, etc.

Just to clarify, as stated, the late groups are not-late, 1-5 minutes late, and 
6-15 minutes late. Also as stated in the original message, anything over 15 
minutes late is dealt with separately, and therefore ignored for the purposes 
of this report.

> 
> Suggestion:
> 
> Start with just the "activation time" and "now", and calculate the difference.
> If they are both given in seconds, you can just subtract:
> 
>    lateness = now - activation_time
> 
> to determine how late that record is. If they are Datetime objects, use a
> Timedelta object.
> 
> That *single* computed field, the lateness, is enough to determine which
> subtotals need to be incremented.

Well, *two* computed fields, one for T1 and one for T2, which are counted 
separately.

> Start by dividing all of time into named
> buckets, in numeric order:
> 
> ...
> for record in records:
>    lateness = now - record.activation_date
>    for end, bucket in buckets:
>        if lateness <= end:
>            bucket.append(record)
>        else:
>            break
> 
> And you're done!
> 
> If you want the *number of records* in a particular bucket, you say:
> 
> len(bucket)
> 
> If you want the total record amount, you say:
> 
> sum(record.total for record in bucket)
> 
> 
> (assuming your records also have a "total" field, if they're invoices say).
> 
> 
> I hope that's even vaguely helpful.

In a sense, in that it supports my initial approach. 

As Stefan Ram pointed out, there is nothing wrong with the solution I have: 
simply using if statements around the calculated lateness of t1 and t2 to 
increment the appropriate counters. I was just thinking there might be tools to 
make the job easier/cleaner/more efficient. From the responses I have gotten, 
it would seem that that is likely not the case, so I'll just say "thank you all 
for your time", and let the matter rest.

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
> 
> 
> 
>> Maybe that will help clear things up. Or not. :-)
> 
> 
> Not even a tiny bit :-(
> 
> 
> 
> 
> 
> -- 
> Steve
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Efficient counting of results

Reply via email to