I have some code that reads files in a leafnode2 news spool & needs to check for new files periodically. The full paths are all like '/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit subdirectory & a Message-ID for the filename itself. I'm using Python 3 & sqlite3 in the standard library.
I have a table of filenames created with the following command: cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used INTEGER)') To check for new files in one of the subdirectories, I run A then either B or C below (I've tried both). A. listing1 = os.listdir(directory) listing [os.path.join(directory, x) for x in listing1] B. cursor = db_conn.cursor() for filename in listing: cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,)) row = cursor.fetchone() if not row: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() C. cursor = db_conn.cursor() subdir_like = directory + '/%' cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', (subdir_like,)) rows = cursor.fetchall() known_files = [row[0] for row in rows] for filename in listing: if filename not in known_files: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() A+B was the first method I came up with, because it looks like the "keep it simple & let the database do its job" approach, but it was very time-consuming, so I tested A+C out. A is quick (a second); B can drag on for over an hour to check 2000 filenames (for example) in a subdirectory; C always takes less than a minute. So C is much better than B, but it looks (to me) like one of those attempts to bypass & ignore the database's built-in optimizations. Comments? -- No sport is less organized than Calvinball! -- https://mail.python.org/mailman/listinfo/python-list