Hi Daniel,

I've noticed this as well and adjusted my schema patterns in order to compensate, and haven't noticed issues since. For example if I were to create tables around a file I'd do the following:

create table FILE(
    ID integer not null,
    PATH varchar(256) not null
);
create table FILE_CONTENT(
    FILE_ID integer not null references FILE(ID),
    CONTENT blob not null
);

Basically the pattern revolves around isolating all binary content into its own table. If you do not, and later you were to add NAME to FILE you would run into the same issue if you were to create an index on NAME.

You will also see a performance bottleneck by simply iterating the FILE_CONTENT result set even without reading the blob.

If you have existing data, you will not be able to move forward without at least one performance hit. What I'd suggest is to break binary into 2 tables analogous to what I've done above then create the indicies you need.

Raymond

On 11-01-10 11:11 AM, Mike Matrigali wrote:
Trejkaz wrote:
Hi all.

I have been doing some performance testing of the time it takes to
create an index on a table with BLOB data.  The schema:

    CREATE TABLE binary (
        id INTEGER NOT NULL,
        binary BLOB(1G) NOT NULL
    )

Each time, 10,000 rows are inserted with data of a given size and then
an index is created at the end:

    CREATE INDEX binary_id ON binary (id)

Times are an average of three runs after some warmup runs which are not counted.

1 kB blobs:
  Index: 567 ms
  Insert: 78 ms

10 kB blobs:
  Insert: 3954 ms
  Index: 515 ms

100 kB blobs:
  Insert: 56307 ms
  Index: 20591 ms

1 MB blobs:
  Insert: 521904 ms
  Index: 122527 ms

Surprisingly, the larger the data which is present in the BLOB column,
the longer it takes to index.  Since the indexing process shouldn't
need to read the BLOB data in order to index the row, I am surprised
that it slows down when more data is present.  What is going on here
exactly?  Is Derby physically copying the BLOB data to a new location
and then deleting the original copy instead of performing some kind of
cheap move operation?  Is there some way to avoid this (which won't
require changing the schema)?

Daniel

This is the expected behavior. Derby base tables are very "basic" and thus the need for indexes. In order to do a scan of the entire base table to build the index every page in the base table needs to be read from disk. Even though we don't actually "read" the blob data for creating the index the system still needs to bring in each page from disk to memory to see if it has any rows. Basically the scan of the base table is to loop from page 1 to the last page and check if each page is a head page with rows and read the rows. The blob data is stored in the same file as the main pages.

Reply via email to