Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Adam Wolff Sun, 23 Apr 2006 12:16:56 -0700

Actually I think this should be an INNER JOIN -- not a LEFT JOIN.

A


On Apr 23, Adam Wolff wrote:

> I didn't look through your code very carefully. Have you confirmed (using 
> EXPLAIN) that your select query is using the index?
> 
> I don't much about the mysql optimizer, but it's possible that this query:
> > $query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 
> > and lon>$lon1 and lon<$lon2";
> Actually runs through the table four times instead of twice, and maybe
> can't even use the index for the whole query.
> 
> Is the performance different if you use a subqery? Some thing like:
> SELECT id, lat, lon FROM integer_test WHERE lat>$lat1 and lat<$lat2 
>  LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2
>                    WHERE lon>$lon and lon<$lon )
>  ON integer_test.id = i2.id
> 
> assuming that you have independent indicies on lat and lon. I didn't try
> this so the syntax may be wrong. You could also dispense with the id idea,
> but if so you should probably declare (lat,lon) as your primary key.
> 
> A
> 
> > and lon>$lon1 and lon<$lon2";
> 
> On Apr 23, Nick Hill wrote:
> 
> > Hello
> > 
> > I have been looking at planning the database strategy for openstreetmap
> > (http://www.openstreetmap.org).
> > 
> > There are several data types stored in tables with longitude and latitude
> > columns. Select statements work by selecting
> > 
> > where lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2
> > 
> > I have made many empirical tests and have concluded:
> > 
> > 1) I can improve performance by a factor of 2-2.5 by changing the double
> > lat/lon to an integer then selecting on an integer.
> > 
> > 2) I have concluded that for each 10 fold increase in the number of records,
> > select queries take twice as long. For each doubling of the number of 
> > returned
> > records, there is a sqrt(2) increase in select query time.
> > 
> > All this is assuming all relevant database information is in memory.
> > 
> > 
> > As the database grows, it would likely improve database performance by
> > splitting an individual table into several thousand tables using the file
> > system directory btree algorithm to effectively pre-select the data before 
> > the
> > query is handled to the MySQL engine. This is not a neat solution. A much
> > better way would be to improve the mysql index performance on very large
> > numbers of records.
> > 
> > Given that there is such a strong relationship between the number of records
> > returned, and query time, I conclude that the whole index tree is matched 
> > for
> > every given number of root x records returned. If all records we are 
> > matching
> > are under a single node or under a small number of nodes in the index tree,
> > perhaps there is some way of telling the database engine to ignore the rest 
> > of
> > the index tree.
> > 
> > Could this work, or am I misunderstanding how the index tree works? Are 
> > there
> > existing optimisations which can de-couple the relationship between number 
> > of
> > records and query time where the records I am selecting are within a small
> > range?
> > 
> > 
> > 
> > Background information:
> > 
> > We can boil all this down to a mathematical relationship where
> > query1 selects s number of records from r records dataset
> > and
> > query2 selects b number of records from c records dataset
> > 
> > Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.
> > 
> > Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
> > Where for my processor, CONST is 0.03
> > 
> > 
> > This can be simplified (loosing some accuracy) to:
> > 
> > Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)
> > 
> > 
> > 
> > 
> > Raw data for selects:
> > Creating a plan with 100000 points and averaging over 25 queries
> > Points_per_tile Query_Time
> > 25600           0.118
> > 25600           0.119
> > 25600           0.119
> > 25600           0.119
> > 12800           0.069
> > 6400            0.042
> > 3200            0.026
> > 1600            0.017
> > 800             0.011
> > 400             0.008
> > 200             0.005
> > 100             0.004
> > 50              0.003
> > Creating a plan with 1000000 points and averaging over 25 queries
> > Points_per_tile Query_Time
> > 25600           0.224
> > 25600           0.223
> > 25600           0.222
> > 25600           0.223
> > 12800           0.145
> > 6400            0.093
> > 3200            0.062
> > 1600            0.043
> > 800             0.029
> > 400             0.020
> > 200             0.015
> > 100             0.011
> > 50              0.008
> > Creating a plan with 10000000 points and averaging over 25 queries
> > Points_per_tile Query_Time
> > 25600           0.558
> > 25600           0.548
> > 25600           0.551
> > 25600           0.551
> > 12800           0.376
> > 6400            0.257
> > 3200            0.181
> > 1600            0.125
> > 800             0.087
> > 400             0.062
> > 200             0.044
> > 100             0.031
> > Creating a plan with 100000000 points and averaging over 25 queries
> > Points_per_tile Query_Time
> > 25600           2.422
> > 25600           2.332
> > 25600           2.493
> > 25600           2.446
> > 12800           1.769
> > 6400            1.295
> > 3200            0.866
> > 1600            0.657
> > 800             0.456
> > 400             0.328
> > 200             0.233
> > 100             0.159
> > 50              0.118
> > 
> > Source code for the above test:
> > #!/usr/bin/perl -w
> > 
> > #Program creates random point fields eqyuivalent to bitfieldtest.pl except 
> > the
> > data is stored
> > #as regular signed integers. To represent the globe as closely as possible,
> > extents between
> > #-180 and +179.999999 will be used. Therefore, adding 180 normalises for
> > international date line 0.
> > #Prime Meridian 180. 111111**0.000001
> > 
> > 
> > use DBI;
> > use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );
> > 
> > $DBHOST = "localhost";
> > $DBNAME = "nickh";
> > $DBUSER = "nickh";
> > $DBPASS = "xxxxxx";
> > 
> > #initialise database
> > $driver = "mysql";
> > $dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST";
> > $dbh = DBI->connect($dsn, $DBUSER, $DBPASS);
> > 
> > [EMAIL PROTECTED](100000000);
> > @plane_densities=(100000,1000000,10000000,100000000);
> > @tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50);
> > $query_iterations=25;
> > $debug=0;
> > 
> > sub create_bitfield;
> > sub run_tests;
> > 
> > foreach $density(@plane_densities){
> > print "Creating a plan with $density points and averaging over
> > $query_iterations queries\nPoints_per_tile Query_Time\n";
> > create_bitfield($density);
> >     foreach $tilepoints(@tile_points){
> >     my $testtime=run_tests($density,$tilepoints);
> >     printf '%-14d  %.3f',$tilepoints,$testtime;  # prints "<1.0>"print
> > "$density   $testtime s\n";
> >     print "\n";
> >     }
> > }
> > 
> > 
> > 
> > $dbh->disconnect;
> > exit 0;
> > 
> > 
> > sub create_bitfield{
> > #takes number of points on the point field as argument.
> > #We first create table without an index, as the indes slow populating table
> > my $prep=q~
> > DROP TABLE IF EXISTS integer_test;
> > 
> > CREATE TABLE `integer_test` (
> >   `lat` integer NOT NULL default '0',
> >   `lon` integer NOT NULL default '0'
> >   ) TYPE=MyISAM
> > ~;
> > 
> > #drop/create tables without index
> > foreach $aprepare(split(/;/,$prep)) {
> > #print "preparing $preparation";
> >     $dbh->do($aprepare);
> >     }
> > 
> > 
> > #populate table
> > for ($i=0;$i<$_[0];$i+=100){
> > #create 100 element batched inserts (value,value),(value,value) etc
> > my $insert='';
> >    for ($j=0;$j<100;$j++){
> >    if ($j>0) {$insert .= ','; }
> >    $lat=int(rand 3599999999)-1800000000;
> >    $lon=int(rand 3599999999)-1800000000;
> >    $insert .= "($lat,$lon)";
> >    }
> > 
> > my $sql="INSERT INTO `integer_test` VALUES $insert;";
> > #print "SQL1 is $sql1\n\n";
> > $dbh->do($sql);
> > 
> > }
> > #After populating table, we create indexes.
> > #print "Creating index... This may take some time\n";
> > my $sql='CREATE INDEX index1 ON integer_test (lat,lon);';
> > $dbh->do($sql);
> > }
> > 
> > 
> > sub run_tests{
> > #Parameters: Points in field; Size of tile in average number of points
> > my $number_of_points=$_[0];
> > my $target_query_return=$_[1];
> > my 
> > $proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return));
> > my $max_extent=1-$proportion_of_each_axis;  #Maximum extent for query 
> > without
> > extending beyond bound
> > my $returnedrows;
> > $query1total=0;
> > 
> > for($i=0;$i<$query_iterations;$i++){
> > #$lat1=(0.4+(rand ($max_extent/10)));
> > #$lon1=(0.4+(rand ($max_extent/10)));
> > 
> > $lat1=int(rand ($max_extent*3599999999))-1800000000;
> > $lon1=int(rand ($max_extent*3599999999))-1800000000;
> > 
> > $lat2=int($lat1+($proportion_of_each_axis*3599999999));
> > $lon2=int($lon1+($proportion_of_each_axis*3599999999));
> > 
> > if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries 
> > \n";}
> > 
> > $query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 and
> > lon>$lon1 and lon<$lon2";
> > 
> > 
> > #print "Query 1: $query1\n";
> > $t0 = Time::HiRes::time();
> > my $sth = $dbh->prepare($query1);
> > $sth->execute();
> > $returnedrows+=$sth->rows();
> > #$sth->finish;
> > $elapsed=Time::HiRes::time()-$t0;
> > #print "Fetched $rows points in $elapsed \n";
> > $query1total+=$elapsed;
> > 
> > 
> > }
> > #print $returnedrows;
> > if ($debug){print "Each of the $query_iterations returned an average " .
> > $returnedrows/$query_iterations . "records\n";}
> > #print "unit test time is " . ($query1total/$query_iterations) . "seconds
> > \n\n";
> > return ($query1total/$query_iterations);
> > }
> > 
> > 
> 

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Reply via email to