Re: Searching for lots of similar strings (filenames) in sqlite3 database

MRAB Tue, 01 Jul 2014 05:16:38 -0700

On 2014-07-01 12:26, Adam Funk wrote:

I have some code that reads files in a leafnode2 news spool & needs to
check for new files periodically.  The full paths are all like
'/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit
subdirectory & a Message-ID for the filename itself.  I'm using Python
3 & sqlite3 in the standard library.


I have a table of filenames created with the following command:

    cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used 
INTEGER)')

To check for new files in one of the subdirectories, I run A then
either B or C below (I've tried both).

A.
     listing1 = os.listdir(directory)
     listing [os.path.join(directory, x) for x in listing1]

B.
     cursor = db_conn.cursor()
     for filename in listing:
         cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
(filename,))
         row = cursor.fetchone()
         if not row:
             cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
             files_new += 1
     db_conn.commit()

C.
     cursor = db_conn.cursor()
     subdir_like = directory + '/%'
     cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', 
(subdir_like,))
     rows = cursor.fetchall()
     known_files =  [row[0] for row in rows]
     for filename in listing:
         if filename not in known_files:
             cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
             files_new += 1
     db_conn.commit()

A+B was the first method I came up with, because it looks like the
"keep it simple & let the database do its job" approach, but it was
very time-consuming, so I tested A+C out.  A is quick (a second); B
can drag on for over an hour to check 2000 filenames (for example) in
a subdirectory; C always takes less than a minute.  So C is much
better than B, but it looks (to me) like one of those attempts to
bypass & ignore the database's built-in optimizations.

Comments?

In C, 'known_files' is a list, so it performs a linear search for each
of the filenames. If you make 'known_files' a set, it'll probably be
even faster!

Anyway, I'm sure there's something in SQL for "insert or update" or "on
duplicate", but that's an SQL question, not a Python question.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Reply via email to