On Tue, 30 Apr 2013 23:56:35 -0700 (PDT)
Paul Vercellotti <pverce...@yahoo.com> wrote:

> We've got some trouble with FTS4 queries taking too long in which
> we're looking for a subset of matching records in the FTS table, as
> narrowed by a non-FTS table.
> {details snipped}

I've recently had to work through a similar issue. My worst case searching
performance wasn't as bad as you're describing but was not "optimal" for
our needs, but I had other problems. One was the size of my full text
index, which could include millions of rows of email message bodies and
various textual header lines. "Pure" inserts into the index weren't too
bad, but anytime the FTS b-tree structures got "too big" and needed to be
merged, sqlite could enter a really slow state allocating tons of memory
during the merge when a single inserted row triggered the behavior. By
"really slow state" I mean in excess of twenty minutes processing after
allocating a single chunk of memory hundreds of megabytes large. When the
full text index grew large enough, it would eventually fail to merge when
it could not allocate a sufficiently large block of data (in excess of 200
megabytes, likely due to a highly fragmented heap; the nature of the
application and the team that has developed it means that the heap is
fragmented, and I can't do much to reduce that fragmentation, even though
otherwise plenty of memory is available, without re-writing significant
other pieces of code; I want to do that eventually, but it is not an option
at the moment).

It occurred to me that because I needed to only do a full text search on a
subset of the full text index (which is not natively possible because of
the nature of the FTS subsystem) that an "ideal" workaround would be to
create a separate full text index for each "sectionID" of my other table.
So instead of just having a single ftstable, I would have "ftstable_1",
"ftstable_2", ... "ftstable_x" where x was my maximum sectionID. I tried
that solution and it unfortunately does not scale. While there is not a
limit on the number of tables you can have (beyond the maximum size of the
database, I assume), adding new tables to a database really gets slow as
the count of tables increases. After several thousand tables have been
added (realizing that a single virtual FTS table results in 3 to 5 physical
shadow tables being created as well), it is visibly slower to continue
adding more tables. I have to assume (though that is dangerous) that
preparing statements to interact with those tables is also going to be
slower, but just the lack of speed creating tables was enough for me to
avoid that solution, so I never tested how fast it was to interact with
that large a set of tables. In my test, I was working with a "schema" that
needed so support approximately 30000 sectionIDs, and that number varies
with the exact nature of the users data.

After some more thinking on the subject, I realized that I didn't
necessarily need to have one full text index per sectionID. It would be
sufficient for my needs to split my full text index into individual
buckets. I took my sectionID modulus some number, and used that remainder
as part of the name of an ftstable that only includes a subset of all full
text index data. The right number for you to use is going to depend on the
nature of your data, but for my purposes I used 53. It is a prime number
that tends to avoid any weird clumping of data that would cause grossly
imbalanced full text index buckets, and instead of tens of thousands of
tables in the database, I only wind up with several hundred. This approach
kept the size of the individual full text indexes down to a more manageable
level and gave me the following advantages:

1. Each single full text index is only roughly 2% of the size of the
original monolithic full text index, so inserts that trigger merging
behavior never exhibit the original worst case performance. This means that
I could support far more messages (theoretically) before suffering failure
due to heap fragmentation (probably), but with the current sizes of
datasets that I need to work with there is no problem.

2. Matching search terms in the full text index is now faster. It is still
not optimal because while I only need the matches for a single sectionID, I
will get matches for multiple sections due to the bucket approach. Since
the average query will return about 2% of the result set size as before,
performance is greatly enhanced.

One final note: for our purposes, we wound up moving our full text indexes
out into a separate database. That is not strictly necessary, but it helped
improve other performance issues related to database contention by allowing
one thread to write to the original database that included the sectionID
table portions, and another thread to build the full text index buckets. By
minimizing write contention between the two threads, we were able to insert
more data per second and decrease the time to build the SQLite databases.
That may or may not be useful to you.

SDR
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to