Re: [sqlite] After deleting data from a FTS table and doing VACUUM, unwanted data remains

2020-02-25 Thread Graham Holden
Tuesday, February 25, 2020, 3:00:09 PM, Luuk  wrote:

[tests snipped]

> So, the index does not grow indefinitely

> On 25-2-2020 14:00, Graham Holden wrote:
>> It is an interesting problem. And the above is just guesswork... It would
>> be good to verify experimentally that the index really does grow
>> indefinitely

Just to avoid (more) confusion, that speculation was from Dan's email
from 2014 (I pasted his response, quoting the original email from
andrewmo who raised the issue: perhaps I should have tried to add
another layer of quoting...)

IIRC (and I probably don't), I think it was found that there wasn't
any "grow indefinitely" involved. I also suspect the cyclic nature of
the post-vacuum numbers (27..52..2..27..52) is indicative of what (I
think) Dan was describing, namely that the "clean-up" isn't always as
"aggressive" as it potentially could be:

>  If you then add even more data so that
> the 16th level-N b-tree is created, everything gets merged together and
> we're back in the optimal state - everything in a single b-tree. However -
> this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time,
> much more data will have to be added before everything is merged together
> again.

Regards,
Graham Holden


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] After deleting data from a FTS table and doing VACUUM, unwanted data remains

2020-02-25 Thread Luuk

script:

#!/bin/bash

if [ ! -f test.db ] ; then sqlite3 test.db "CREATE VIRTUAL TABLE tab 
USING fts5(x)"; fi

sqlite3 test.db ".import wikipedia tab"
a=$(sqlite3 test.db "SELECT count(*) FROM tab_data")
echo "# records after import: $a"
sqlite3 test.db "DELETE FROM tab"
a=$(sqlite3 test.db "SELECT count(*) FROM tab_data")
echo "# records after DELETE: $a"
sqlite3 test.db "VACUUM"
a=$(sqlite3 test.db "SELECT count(*) FROM tab_data")
echo "# records after vacuum: $a"

output:

$> ./test.sh
# records after import: 15
# records after DELETE: 27
# records after vacuum: 27
$> ./test.sh
# records after import: 40
# records after DELETE: 52
# records after vacuum: 52
$> ./test.sh
# records after import: 65
# records after DELETE: 2
# records after vacuum: 2
$> ./test.sh
# records after import: 15
# records after DELETE: 27
# records after vacuum: 27
$> ./test.sh
# records after import: 40
# records after DELETE: 52
# records after vacuum: 52
$> vi testsh
$> vi test.sh
$>

So, the index does not grow indefinitely

On 25-2-2020 14:00, Graham Holden wrote:

It is an interesting problem. And the above is just guesswork... It would
be good to verify experimentally that the index really does grow
indefinitely

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] After deleting data from a FTS table and doing VACUUM, unwanted data remains

2020-02-25 Thread Graham Holden
This might be to do with how an FTS index works under the hood,
involving various levels of "b-tree" that grow as entries are added,
but aren't always shrunk when entries are deleted.

There were a bunch of emails on the list around 4th to the 13th May
2014: sample below from Dan Kennedy (one of the SQLite devs) which
includes the original problem report and his summary of "why".

I've a feeling SOME changes MIGHT have been made to how the various
levels of b-tree were merged when entries were deleted, but I'm not
seeing anything linked to the original thread... perhaps this email
will jog a memory...

Graham


START - Mail from Dan Kennedy, 4 May 2014 - START

On 05/01/2014 03:30 PM, andrewmo wrote:
> We are using the FTS3 extension to sqlite to store large numbers of short
> (~300 byte) documents. This is working very well and providing us with very
> fast text search, but the behaviour around deletion of documents has me
> confused.
>
> Our system must control the the size of the database and will delete the
> oldest documents when the database size breaches a certain limit. I now
> understand from comments on this mailing list and elsewhere that this is not
> an optimal pattern for the FTS extension as doclists for the oldest
> documents are the least likely to be 'merged'.
>
> My question is, does this actually work at all? If I delete a row from my
> FTS4 table (resulting in a new empty doclist being added to the index), then
> I subsequently add many (1000s) new documents and call the 'merge' function
> several times (automerge is also enabled), is there any gaurentee that the
> empty doclist and the populated doclist that it superseded will ever be
> removed? My testing suggests this isn't the case.
>
> I have a 1GB database with 6million documents. If I keep adding new
> documents at around 1 per second and deleting documents when the size of the
> data goes beyond 1GB, the size of the index seems to grow and the number of
> documents I can store in the 1GB file seems decrease in a linear manner.
>
> Calling the 'optimize' function seems to solve this issue (removing all the
> dead doclists), but that isn't practical for our software, as it implies
> some downtime for our high availablity service due to the long execution
> time of the optimize function (Could be minutes for a 1GB file).
>
> I have seen this
> (http://sqlite.1065341.n5.nabble.com/fts3-database-grows-td42069.html) post
> from 2008. However, it predates the 'automerge' and manual merge features,
> and from the documentation I assumed these new features would delete all the
> data related to deleted documents. Am I incorrect in my assumption?
>
> Thanks for any clarification you can offer.

Normally, when you write to an FTS index (either to add new doclists or to
add delete markers) the new entries are accumulated in-memory for a while
and then flushed to a new "level-0" b-tree. A level-0 b-tree is often
roughly 1MB in size. Once there are 16 level-0 b-trees, they are merged
and written to a single level-1 b-tree. Once there are 16 level-1 b-trees...
And so on.

So when an entry is deleted from the FTS index, a delete marker is added.
But the original doclists are not actually deleted until the delete marker
and the doclists are merged into the same b-tree. Delete markers are
discarded when they are merged into the oldest b-tree in the index.

At first glance it seems (to me) that this means the index might grow to
anything up to 16 times its "optimized" size. But I think it's actually
worse than that.

Say your entire database fits into a single level-N b-tree. You keep adding
data (and delete markers) until there are 15 level-N b-trees and almost
enough data to create the 16th in lower levels. So at this point the FTS
index is 16 times its optimal size. If you then add even more data so that
the 16th level-N b-tree is created, everything gets merged together and
we're back in the optimal state - everything in a single b-tree. However -
this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time,
much more data will have to be added before everything is merged together
again.

So I'm thinking a solution might be:

   * Fix FTS so that it picks this case - when a merge includes so many
 delete markers that the output is small enough to be deemed a level-N
 b-tree, not a level-N+1 b-tree, and

   * Instead of using the default 16-way merges, the app could organize
 to periodically invoke the "merge=X,Y" command with a smaller Y value
 (say 2) to limit the maximum size of the index to Y times its optimal
 size (instead of 16 times).

It is an interesting problem. And the above is just guesswork... It would
be good to verify experimentally that the index really does grow 
indefinitely
with this kind of input before trying to "fix" anything.

Dan.

ENDS - Mail from Dan Kennedy, 4 May 2014 - ENDS

Tuesday, February 25, 2020, 11:52:59 AM, Matt Kloss  
wrote:

>