The technique of adding extra columns to indexes works well for derby if
it matching your application needs. The docs usually refer to this as a
covering index and the optimizer is pretty good at looking for cases
where it can use a covering index and avoid going to the base table.
Hopefully this will help your application.
From info posted, the query with the index is definitely doing an index
to base row lookup for every qualifying row. It looks like for this
data distribution and number of qualifying rows this a worst plan than
just doing a full table scan.
The optimizer should have picked the base table scan
rather than the index to base row given how the 2 performed. I think
this is another case showing the costs for the optimizer need to be
updated to reflect current technology. There has been a lot of work to
make scans go fast and that is not reflected in the current costing.
The optimizer estimated 203425 rows for the index qualification and got
271136 which seems not too bad (5% vs. 6.8%). This info comes from the
with-index.txt query plan. It assumes equal distribution of values
across all the values so maybe this range was a
little hotter than others. Since the row count estimate looks close I
lean toward the base costing as the problem.
It would be interesting to know how much of the index/baserow
performance issue is that it keeps getting cache misses vs. the cpu
overhead of just doing the base row look up for every row. For this db
it would take a 50,000 page cache just to cache the base row plus
whatever it takes to cache the index.
For this kind of distribution I have seen db's gather all the row
pointers from the index and then do ordered probes into the base table.
This insures good caching for the lookups. Derby does not have this
technique available.
Tim Dudgeon wrote:
Thanks for the comments. Here is some more info.
I attach the DDL for the table concerned, the simple test program I use
and the execution strategy with and without an index.
Some additional points:
1. the query returning 7% of the table is certainly not an extreme case.
The exact query criteria are specified by the user in the UI, and can be
much worse than this case. I have no control over the natur eof the
query that the user specifies.
2. Yes, if the query is much more selective the index can be a help.
3. The biggest data file in seg0 is 1452572672 bytes in size (e.g. 1.4GB).
4. the index was added after the rows were added.
5. making the index also have the pk_column as the second indexed field
makes it go like lightning! search runs in 2 secs, about 14x faster.
So in summary it seems like an index will be of no help to me in this
situation, unless I make it an index of both columns.
Many thanks
Tim
Tim Dudgeon wrote:
I found an interesting preformace problem that I'd welcome some help
in inderstanding.
I have a table of 4,000,000 rows that has a DOUBLE column containing
floating point numbers.
I run a query like this:
select pk_column from the_table where the_double_column > 300 and
the_double_column < 320
271136 rows are returned.
I then go through the ResultSet and extract all the id_column values.
All of this is done using standard JDBC.
When I do this it takes 23 seconds, which I though was not
unreasonable as a full table scan was involved and the table was
pretty big.
But of course I thought an index would help, so I added an index to
the the_double_column and repeated the process. It took 720 seconds,
31x slower! I thought this was strange, but thought it might be
because I was using standard settings and the 4MB page cache was much
too small to hold the index.
So I increased the page cache size (derby.storage.pageCacheSize
property) to 10x the size (10,000) and repeated the process. There was
only a very minor improvement in speed.
In all cases the memory usage, as reported by:
Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
really no differnt, and the used memory was much less that the maximum
available specified by the -Xmx setting.
Any ideas what to do?
Tim