Hi Daniel,
I've noticed this as well and adjusted my schema patterns in order to
compensate, and haven't noticed issues since. For example if I were to
create tables around a file I'd do the following:
create table FILE(
ID integer not null,
PATH varchar(256) not null
);
create table FILE_CONTENT(
FILE_ID integer not null references FILE(ID),
CONTENT blob not null
);
Basically the pattern revolves around isolating all binary content into
its own table. If you do not, and later you were to add NAME to FILE
you would run into the same issue if you were to create an index on NAME.
You will also see a performance bottleneck by simply iterating the
FILE_CONTENT result set even without reading the blob.
If you have existing data, you will not be able to move forward without
at least one performance hit. What I'd suggest is to break binary into
2 tables analogous to what I've done above then create the indicies you
need.
Raymond
On 11-01-10 11:11 AM, Mike Matrigali wrote:
Trejkaz wrote:
Hi all.
I have been doing some performance testing of the time it takes to
create an index on a table with BLOB data. The schema:
CREATE TABLE binary (
id INTEGER NOT NULL,
binary BLOB(1G) NOT NULL
)
Each time, 10,000 rows are inserted with data of a given size and then
an index is created at the end:
CREATE INDEX binary_id ON binary (id)
Times are an average of three runs after some warmup runs which are
not counted.
1 kB blobs:
Index: 567 ms
Insert: 78 ms
10 kB blobs:
Insert: 3954 ms
Index: 515 ms
100 kB blobs:
Insert: 56307 ms
Index: 20591 ms
1 MB blobs:
Insert: 521904 ms
Index: 122527 ms
Surprisingly, the larger the data which is present in the BLOB column,
the longer it takes to index. Since the indexing process shouldn't
need to read the BLOB data in order to index the row, I am surprised
that it slows down when more data is present. What is going on here
exactly? Is Derby physically copying the BLOB data to a new location
and then deleting the original copy instead of performing some kind of
cheap move operation? Is there some way to avoid this (which won't
require changing the schema)?
Daniel
This is the expected behavior. Derby base tables are very "basic" and
thus the need for indexes. In order to do a scan of the entire base
table to build the index every page in the base table needs to be read
from disk. Even though we don't actually "read" the blob data for
creating the index the system still needs to bring in each page from
disk to memory to see if it has any rows. Basically the scan of the
base table is to loop from page 1 to the last page and check if each
page is a head page with rows and read the rows. The blob data is
stored in the same file as the main pages.