Re: Indexing speed in presence of BLOB columns

Raymond Kroeker Sun, 16 Jan 2011 12:56:07 -0800

Hi Daniel,

I've noticed this as well and adjusted my schema patterns in order tocompensate, and haven't noticed issues since. For example if I were tocreate tables around a file I'd do the following:


create table FILE(
    ID integer not null,
    PATH varchar(256) not null
);
create table FILE_CONTENT(
    FILE_ID integer not null references FILE(ID),
    CONTENT blob not null
);

Basically the pattern revolves around isolating all binary content intoits own table. If you do not, and later you were to add NAME to FILEyou would run into the same issue if you were to create an index on NAME.

You will also see a performance bottleneck by simply iterating theFILE_CONTENT result set even without reading the blob.

If you have existing data, you will not be able to move forward withoutat least one performance hit. What I'd suggest is to break binary into2 tables analogous to what I've done above then create the indicies youneed.


Raymond

On 11-01-10 11:11 AM, Mike Matrigali wrote:

Trejkaz wrote:
Hi all.

I have been doing some performance testing of the time it takes to
create an index on a table with BLOB data.  The schema:

    CREATE TABLE binary (
        id INTEGER NOT NULL,
        binary BLOB(1G) NOT NULL
    )

Each time, 10,000 rows are inserted with data of a given size and then
an index is created at the end:

    CREATE INDEX binary_id ON binary (id)
Times are an average of three runs after some warmup runs which arenot counted.
1 kB blobs:
  Index: 567 ms
  Insert: 78 ms

10 kB blobs:
  Insert: 3954 ms
  Index: 515 ms

100 kB blobs:
  Insert: 56307 ms
  Index: 20591 ms

1 MB blobs:
  Insert: 521904 ms
  Index: 122527 ms

Surprisingly, the larger the data which is present in the BLOB column,
the longer it takes to index.  Since the indexing process shouldn't
need to read the BLOB data in order to index the row, I am surprised
that it slows down when more data is present.  What is going on here
exactly?  Is Derby physically copying the BLOB data to a new location
and then deleting the original copy instead of performing some kind of
cheap move operation?  Is there some way to avoid this (which won't
require changing the schema)?

Daniel
This is the expected behavior. Derby base tables are very "basic" andthus the need for indexes. In order to do a scan of the entire basetable to build the index every page in the base table needs to be readfrom disk. Even though we don't actually "read" the blob data forcreating the index the system still needs to bring in each page fromdisk to memory to see if it has any rows. Basically the scan of thebase table is to loop from page 1 to the last page and check if eachpage is a head page with rows and read the rows. The blob data isstored in the same file as the main pages.

Re: Indexing speed in presence of BLOB columns

Reply via email to