On Tue, 30 Apr 2013 23:56:35 -0700 (PDT) Paul Vercellotti <pverce...@yahoo.com> wrote:
> We've got some trouble with FTS4 queries taking too long in which > we're looking for a subset of matching records in the FTS table, as > narrowed by a non-FTS table. > {details snipped} I've recently had to work through a similar issue. My worst case searching performance wasn't as bad as you're describing but was not "optimal" for our needs, but I had other problems. One was the size of my full text index, which could include millions of rows of email message bodies and various textual header lines. "Pure" inserts into the index weren't too bad, but anytime the FTS b-tree structures got "too big" and needed to be merged, sqlite could enter a really slow state allocating tons of memory during the merge when a single inserted row triggered the behavior. By "really slow state" I mean in excess of twenty minutes processing after allocating a single chunk of memory hundreds of megabytes large. When the full text index grew large enough, it would eventually fail to merge when it could not allocate a sufficiently large block of data (in excess of 200 megabytes, likely due to a highly fragmented heap; the nature of the application and the team that has developed it means that the heap is fragmented, and I can't do much to reduce that fragmentation, even though otherwise plenty of memory is available, without re-writing significant other pieces of code; I want to do that eventually, but it is not an option at the moment). It occurred to me that because I needed to only do a full text search on a subset of the full text index (which is not natively possible because of the nature of the FTS subsystem) that an "ideal" workaround would be to create a separate full text index for each "sectionID" of my other table. So instead of just having a single ftstable, I would have "ftstable_1", "ftstable_2", ... "ftstable_x" where x was my maximum sectionID. I tried that solution and it unfortunately does not scale. While there is not a limit on the number of tables you can have (beyond the maximum size of the database, I assume), adding new tables to a database really gets slow as the count of tables increases. After several thousand tables have been added (realizing that a single virtual FTS table results in 3 to 5 physical shadow tables being created as well), it is visibly slower to continue adding more tables. I have to assume (though that is dangerous) that preparing statements to interact with those tables is also going to be slower, but just the lack of speed creating tables was enough for me to avoid that solution, so I never tested how fast it was to interact with that large a set of tables. In my test, I was working with a "schema" that needed so support approximately 30000 sectionIDs, and that number varies with the exact nature of the users data. After some more thinking on the subject, I realized that I didn't necessarily need to have one full text index per sectionID. It would be sufficient for my needs to split my full text index into individual buckets. I took my sectionID modulus some number, and used that remainder as part of the name of an ftstable that only includes a subset of all full text index data. The right number for you to use is going to depend on the nature of your data, but for my purposes I used 53. It is a prime number that tends to avoid any weird clumping of data that would cause grossly imbalanced full text index buckets, and instead of tens of thousands of tables in the database, I only wind up with several hundred. This approach kept the size of the individual full text indexes down to a more manageable level and gave me the following advantages: 1. Each single full text index is only roughly 2% of the size of the original monolithic full text index, so inserts that trigger merging behavior never exhibit the original worst case performance. This means that I could support far more messages (theoretically) before suffering failure due to heap fragmentation (probably), but with the current sizes of datasets that I need to work with there is no problem. 2. Matching search terms in the full text index is now faster. It is still not optimal because while I only need the matches for a single sectionID, I will get matches for multiple sections due to the bucket approach. Since the average query will return about 2% of the result set size as before, performance is greatly enhanced. One final note: for our purposes, we wound up moving our full text indexes out into a separate database. That is not strictly necessary, but it helped improve other performance issues related to database contention by allowing one thread to write to the original database that included the sectionID table portions, and another thread to build the full text index buckets. By minimizing write contention between the two threads, we were able to insert more data per second and decrease the time to build the SQLite databases. That may or may not be useful to you. SDR _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users