On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark <st...@mit.edu> wrote:
> I was thinking about this over the past couple weeks. I'm starting to
> think the quicksort runs gives at least the beginnings of a way
> forward on this front.

As I've already pointed out several times, I wrote a tool that makes
it easy to load sortbenchmark.org data into a PostgreSQL table:

https://github.com/petergeoghegan/gensort

(You should use the Python script that invokes the "gensort" utility
-- see its "--help" display for details).

This seems useful as a standard benchmark, since it's perfectly
deterministic, allowing the user to create arbitrarily large tables to
use for sort benchmarks. Still, it doesn't produce data that is any
way organic; sort data is uniformly distributed. Also, it produces a
table that really only has one attribute to sort on, a text attribute.

I suggest looking at real world data, too. I have downloaded UK land
registry data, which is a freely available dataset about property
sales in the UK since the 1990s, of which there have apparently been
about 20 million (I started with a 20 million line CSV file). I've
used COPY to load the data into one PostgreSQL table.

I attach instructions on how to recreate this, and some suggested
CREATE INDEX statements that seemed representative to me. There are a
variety of Postgres data types in use, including UUID, numeric, and
text. The final Postgres table is just under 3GB. I will privately
make available a URL that those CC'd here can use to download a custom
format dump of the table, which comes in at 1.1GB (ask me off-list if
you'd like to get that URL, but weren't CC'd here). This URL is
provided as a convenience for reviewers, who can skip my detailed
instructions.

An expensive rollup() query on the "land_registry_price_paid_uk" table
is interesting. Example:

select date_trunc('year', transfer_date), county, district, city,
sum(price) from land_registry_price_paid_uk group by rollup (1,
county, district, city);

Performance is within ~5% of an *internal* sort with the patch series
applied, even though ~80% of time is spent copying and sorting
SortTuples overall in the internal sort case (the internal case cannot
overlap sorting and aggregate processing, since it has no final merge
step). This is a nice demonstration of how this work has significantly
blurred the line between internal and external sorts.

-- 
Peter Geoghegan
Instructions
============

CSV File
--------

The land registry file from http://data.gov.uk is 3.2GB. A CSV file that can be
loaded into PostgreSQL that has organic data. No registration required. See
https://theodi.org/blog/the-status-of-csvs-on-datagovuk for details on
downloaded the file pp-complete.csv.

SQL
---

begin;
create table land_registry_price_paid_uk(
  transaction uuid,
  price numeric,
  transfer_date date,
  postcode text,
  property_type char(1),
  newly_built boolean,
  duration char(1),
  paon text,
  saon text,
  street text,
  locality text,
  city text,
  district text,
  county text,
  ppd_category_type char(1));

copy land_registry_price_paid_uk FROM '/home/pg/Downloads/pp-complete.csv' with 
(format csv, freeze true, encoding 'win1252', header false, null '', quote '"', 
force_null (postcode, saon, paon, street, locality, city, district));
commit;

Resulting table
---------------

postgres=# \dt+
                              List of relations
 Schema │            Name             │ Type  │ Owner │  Size   │ 
Description 
────────┼─────────────────────────────┼───────┼───────┼─────────┼─────────────
 public │ land_registry_price_paid_uk │ table │ pg    │ 2779 MB │ 
(1 row)

Interesting Indexes
===================

Many attribute index (Low cardinality leading attributes):

postgres=# create index on land_registry_price_paid_uk_suffix(county, district, 
city, locality, street);

UUID pk index (UUID type, high cardinality):

postgres=# create index on land_registry_price_paid_uk (transaction);

Price index (numeric, moderate cardinality):

postgres=# create index on land_registry_price_paid_uk (price);

Preview
=======

pg@hamster:~$ head ~/Downloads/pp-complete.csv
"{0C7ADEF5-878D-4066-B785-0000003ED74A}","163000","2003-02-21 00:00","UB5 
4PJ","T","N","F","106","","READING 
ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A"
"{35F67271-ABD4-40DA-AB09-00000085B9D3}","247500","2005-07-15 00:00","TA19 
9DD","D","N","F","58","","ADAMS MEADOW","ILMINSTER","ILMINSTER","SOUTH 
SOMERSET","SOMERSET","A"
"{B20B1C74-E8E1-4137-AB3E-0000011DF342}","320000","2010-09-10 00:00","W4 
1DZ","F","N","L","58","","WHELLOCK ROAD","","LONDON","EALING","GREATER 
LONDON","A"
"{7D6B0915-C56B-4275-AF9B-00000156BCE7}","104000","1997-08-27 00:00","NE61 
2BH","D","N","F","17","","WESTGATE","MORPETH","MORPETH","CASTLE 
MORPETH","NORTHUMBERLAND","A"
"{47B60101-B64C-413D-8F60-000002F1692D}","147995","2003-05-02 00:00","PE33 
0RU","D","N","F","4","","MASON GARDENS","WEST WINCH","KING'S LYNN","KING'S LYNN 
AND WEST NORFOLK","NORFOLK","A"
"{51F797CA-7BEB-4958-821F-000003E464AE}","110000","2013-03-22 00:00","NR35 
2SF","T","N","F","5","","WILD FLOWER WAY","DITCHINGHAM","BUNGAY","SOUTH 
NORFOLK","NORFOLK","A"
"{AB8CE5AE-DFED-46CE-A682-000004533355}","180000","2003-12-23 00:00","SA8 
4DG","D","Y","F","17","","OAKWOOD DRIVE","CLYDACH","SWANSEA","NEATH PORT 
TALBOT","NEATH PORT TALBOT","A"
"{1FB4E294-102B-475C-876A-000005E78B1E}","177000","2012-05-29 00:00","BS4 
5AW","S","N","F","11","","WATER LANE","BRISLINGTON","BRISTOL","CITY OF 
BRISTOL","CITY OF BRISTOL","A"
"{3B0DA29C-C89A-4FAA-918A-0000074FA0E0}","190000","2013-10-04 00:00","SN14 
8LU","T","N","F","148","","HIGH STREET","MARSHFIELD","CHIPPENHAM","SOUTH 
GLOUCESTERSHIRE","SOUTH GLOUCESTERSHIRE","A"
"{9F669E50-AF1E-4E8A-9A66-0000075006C7}","68000","1999-05-21 00:00","CR0 
2JJ","S","N","F","83","","WHITEHORSE ROAD","","CROYDON","CROYDON","GREATER 
LONDON","A"

Notes from UK Land Registry authority:
======================================

Transaction unique identifier   A reference number which is generated 
automatically recording each published sale. The number is unique and will 
change each time a sale is recorded.
Price   Sale price stated on the transfer deed.
Date of Transfer        Date when the sale was completed, as stated on the 
transfer deed.
Postcode        This is the postcode used at the time of the original 
transaction. Note that postcodes can be reallocated and these changes are not 
reflected in the Price Paid Dataset.
Property Type   D = Detached, S = Semi-Detached, T = Terraced, F = 
Flats/Maisonettes, O = Other 
Note that: 
- we only record the above categories to describe property type, we do not 
separately identify bungalows. 
- end-of-terrace properties are included in the Terraced category above. 
- ‘Other’ is only valid where the transaction relates to a property type 
that is not covered by existing values.
Old/New Indicates the age of the property and applies to all price paid 
transactions, residential and non-residential.
Y = a newly built property, N = an established residential building
Duration        Relates to the tenure: F = Freehold, L= Leasehold etc.
Note that Land Registry does not record leases of 7 years or less in the Price 
Paid Dataset.
PAON    Primary Addressable Object Name. If there is a sub-building for example 
the building is divided into flats, see Secondary Addressable Object Name 
(SAON).
SAON    Secondary Addressable Object Name. If there is a sub-building, for 
example the building is divided into flats, there will be a SAON.
Street   
Locality         
Town/City        
District         
County   
PPD Category Type       Indicates the type of Price Paid transaction.
A = Standard Price Paid entry, includes single residential property sold for 
full market value.
B = Additional Price Paid entry including transfers under a power of 
sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) 
and transfers to non-private individuals.

Note that category B does not separately identify the transaction types stated.
Land Registry has been collecting information on Category A transactions from 
January 1995. Category B transactions were identified from October 2013.
Record Status - monthly file only       Indicates additions, changes and 
deletions to the records.(see guide below).
A = Addition
C = Change
D = Delete.

Note that where a transaction changes category type due to misallocation (as 
above) it will be deleted from the original category type and added to the 
correct category with a new transaction unique identifier.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to