Upping work_mem did roughly halve the time, but after thinking about
Shaun's suggestion, I figured it's better to calculate this stuff once and
then store it. So here is how the table looks now:

                                          Table "public.stop_event"
       Column        |            Type             |
 Modifiers
---------------------+-----------------------------+---------------------------------------------------------
 stop_time           | timestamp without time zone | not null
 stop                | integer                     | not null
 bus                 | integer                     | not null
 direction           | integer                     | not null
 route               | integer                     | not null
 id                  | bigint                      | not null default
nextval('stop_event_id_seq'::regclass)
 previous_stop_event | bigint                      |
Indexes:
    "stop_event_pkey" PRIMARY KEY, btree (id)
    "stop_event_previous_stop_event_idx" btree (previous_stop_event)
Foreign-key constraints:
    "stop_event_direction_id_fkey" FOREIGN KEY (direction) REFERENCES
direction(id)
    "stop_event_previous_stop_event_fkey" FOREIGN KEY (previous_stop_event)
REFERENCES stop_event(id)
    "stop_event_route_fkey" FOREIGN KEY (route) REFERENCES route(id)
    "stop_event_stop" FOREIGN KEY (stop) REFERENCES stop(id)
Referenced by:
    TABLE "stop_event" CONSTRAINT "stop_event_previous_stop_event_fkey"
FOREIGN KEY (previous_stop_event) REFERENCES stop_event(id)

previous_stop_event simply references the previous (by stop_time) stop
event for the combination of stop, route, and direction. I have
successfully populated this column for my existing test data. However, when
I try to do a test self-join using it, Postgres does two seq scans across
the whole table, even though I have indexes on both id and
previous_stop_event: http://explain.depesz.com/s/ctck. Any idea why those
indexes are not being used?

Thank you again,
-Eli

On Thu, Aug 21, 2014 at 9:05 AM, Shaun Thomas <stho...@optionshouse.com>
> wrote:
>
>> On 08/21/2014 08:29 AM, Eli Naeher wrote:
>>
>>  With around 1.2 million rows, this takes 20 seconds to run. 1.2 million
>>> rows is only about a week's worth of data, so I'd like to figure out a
>>> way to make this faster.
>>>
>>
>> Well, you'll probably be able to reduce the run time a bit, but even with
>> really good hardware and all in-memory processing, you're not going to see
>> significant run-time improvements with that many rows. This is one of the
>> reasons reporting-specific structures, such as fact tables, were designed
>> to address.
>>
>> Repeatedly processing the same week/month/year aggregate worth of several
>> million rows will just increase linearly with each iteration as data size
>> increases. You need to maintain up-to-date aggregates on the metrics you
>> actually want to measure, so you're only reading the few hundred rows you
>> introduce every update period. You can retrieve those kind of results in a
>> few milliseconds.
>>
>> --
>> Shaun Thomas
>> OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
>> 312-676-8870
>> stho...@optionshouse.com
>>
>> ______________________________________________
>>
>> See http://www.peak6.com/email_disclaimer/ for terms and conditions
>> related to this email
>>
>
>

Reply via email to