Thank you Mike,
Your explanation does make sense to me.
Now what should I do? I think there are two possible JIRA candidates:
1) A documentation issue for the Tuning Guide.
I read that guide to find a solution for my performance problem
but could not find any tip that seemed applicable. Maybe
something could be added with respect to Blob's, especially
when these Blob's are really big, compared to page size.
The trick of using a dummy where clause might be mentioned.
2) A design issue for the optimizer: always use an index
if one is available.
Do you think I should file these issues? (one, both or none).
I will anyway try to find out how to produce query plans.
If not for this problem, then for future problems. I read
about them somewhere in the documentation. Do you think they are
helpfull in this particular case? Your explanation seems quite
satisfactory to me.
The suggestion by Michael Segel (thank you too for your
contribution) to redesign the storage of LOB data separate from
normal table data is way beyond my league to comment on.
However, to me it seems that he may be right in pinning down
a real problem.
There is something else that I was wondering about. You may have noticed
the SIZE and CRC column in my blob table together with the UNIQUE
constraint. Obviously, I want to prevent duplicate blob's. What happens if
I add a UNIQUE constraint on the blob data itself? Would it be accepted?
If so, how would Derby enforce this constraint? Comparing all the data from
start to finish? I hesitate to try, because I am afraid that when I add a
blob, that
some process may be started that will run virtually for ever.
Kind regards,
Piet Blok
----- Original Message -----
From: "Mike Matrigali" <[EMAIL PROTECTED]>
To: "Derby Discussion" <[email protected]>
Sent: Wednesday, July 12, 2006 7:40 PM
Subject: Re: Strange performance issue with BLOB's
I think the following is what is going on. The blobs are not
really instantiated in memory in one sense, but the entire
container is read through the cache a page at a time.
The base container of a derby table is a sequence of pages. Derby
stores blob data in the same container as the rest of the data,
in a page link chain. There is a map of pages used only for
space allocation, not page types. So in the scan case Derby
visits every page in the container and then checks whether each
page is a "main" page or not. Main pages are returned up the
stack to access to process each row. In the case of your
query no blob data is actually accessed so "access" never requests
the column data for the blob data - but the whole table is
still read through the cache. There is no support in the base
container implementation to "skip" to the next main page - that
is what indexes provide.
This is an interesting case for the optimizer. My guess is that
the optimizer never considers using an index if there is no
where clause unless the index "covers" (includes all the columns
of the query). I am not sure how hard it would be to get the
optimizer to consider using the index in this case, and I am
not sure if the existing costing handles this blob case - but
from your experiment is sort of looks like the costing is right
as the optimizer is choosing the index when you added the
dummy where - I am sure the costing of that where indicated that
it would scan the entire index.
With performance issues like this a lot is made much more obvious
if you can include the query plan of the 2 queries.
Piet Blok wrote:
Hi all,
When experimenting with BLOB's I ran into a performance issue
that I cannot completely explain, but it could be a bug.
Given the following table:
CREATE TABLE BLOB_TABLE (
BLOB_ID BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1,
INCREMENT BY 1),
BLOB_SIZE BIGINT NOT NULL,
BLOB_CRC BIGINT NOT NULL,
BLOB_NAME VARCHAR(255) NOT NULL,
BLOB_DATA BLOB(2G) NOT NULL,
UNIQUE (BLOB_CRC, BLOB_SIZE),
PRIMARY KEY (BLOB_ID)
);
which is populated with 27 rows,
where the sum of all BLOB sizes is 5,885,060,164 bytes
(about 200 MB average per BLOB, but ranging from 10 MB to 750 MB).
Some queries on this table are executed really
fast (almost instantaneous response).
However, the following query needs about 10 minutes to complete:
SELECT BLOB_ID, BLOB_NAME, BLOB_SIZE, BLOB_CRC FROM BLOB_TABLE;
I reasoned that maybe Derby is scanning the whole table
(including the blob contents) so I tried to add a dummy WHERE
clause (dummy because all BLOB_ID's are greater than 0)
to offer a clue as to what rows (all of course) are needed,
as follows
SELECT BLOB_ID, BLOB_NAME, BLOB_SIZE, BLOB_CRC FROM BLOB_TABLE WHERE
BLOB_ID > 0;
and that helped: instantaneous response.
But I really think that the original query,
without the where clause, should not be any slower.
I am using Derby 10.1.3.1 embedded, Windows XP and Sun Java 1.5.0_06.
Both queries executed with a Statement, not a PreparedStatement.
Kind regards,
Piet Blok