On Thu, Dec 12, 2013 at 9:30 AM, Sev Zaslavsky <sev...@gmail.com> wrote:
> Hello, > > I've got a very simple table with a very simple SELECT query, but it takes > longer on the initial run than I'd like, so I want to see if there is a > strategy to optimize this. > > Table rt_h_nbbo contains several hundred million rows. All rows for a > given entry_date are appended to this table in an overnight process every > night - on the order of several million rows per day. > > The objective is to select all of the rows for a given product_id on a > given entry_date. > > There is a b-tree index on (product_id, entry_date). The index appears to > be used correctly. I'm seeing that if the data pages are not in memory, > nearly all of the time is spent on disk I/O. The first time, the query > takes 21 sec. If I run this query a second time, it completes in approx > 1-2 ms. > > I perceive an inefficiency here and I'd like your input as to how to deal > with it: The end result of the query is 1631 rows which is on the order of > about a couple hundred Kb of data. Compare that to the amount of I/O that > was done: 1634 buffers were loaded, 16Mb per page - that's about 24 Gb of > data! > A page is usually 8KB, not 16MB (nor 16Mb). > Query completed in 21 sec. I'd like to be able to physically > re-organize the data on disk so that the data for a given product_id on a > entry_date is concentrated on a few pages instead of being scattered like I > see here. > If you load the data in daily batches, it is probably already fairly well clustered by entry_date. If you sort the batch by product_id before bulk loading it, then it should stay pretty well clustered on (entry_date, product_id). Cheers, Jeff