On 2014-07-01, Chris Angelico wrote: > On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk <a24...@ducksburg.com> wrote: >> cursor.execute('SELECT filename FROM files WHERE filename IS ?', >> (filename,)) > > Shouldn't this be an equality check rather than IS, which normally I'd > expect to be "IS NULL" or "IS NOT NULL"?
Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', & 'IS' are interchangeable. http://www.tutorialspoint.com/sqlite/sqlite_operators.htm Looking at that page again, I see that 'GLOB' is a case-sensitive version of 'LIKE'. I can't help but wonder if that makes it faster. ;-) > As to your actual question: Your two database lookups are doing > distinctly different things, so there's no surprise that they perform > very differently. B asks the database "Do you have this? Do you have > this?" for every file you have, and C asks the database "What do you > have?", and then comparing that against the list of files. By the way > - the A+C technique could be done quite tidily as a set difference: > > # assume you have listing1 and cursor set up > # as per your above code > listing = {os.path.join(directory, x) for x in listing1} > cursor.execute(...) # as per above > known_files = {row[0] for row in cursor} # cursors are iterable > needed_files = listing - known_files > cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename, > 0) for filename in needed_files)) Oh, even better: add_files = listing - known_files delete_files = known_files - listing and then I can remove files that have disappeared off the spool from the table. Thanks very much! > Anyway. The significant thing is the performance of the database on > two different workloads: either "give me everything that matches this > pattern" (where the pattern ends with a percent sign), or "do you have > this? do you have this? do you have this?". Generally, database > indexing is fairly efficient at handling prefix searches, so the first > query will basically amount to an index search, which is a lot faster > than the repeated separate searching; it takes advantage of the fact > that all the strings you're looking at will have the same prefix. > > There is one critical consideration, though. What happens if the > directory name contains an underscore or percent sign? Or can you > absolutely guarantee that they won't? You may need to escape them, and > I'm not sure how SQLite handles that. (Possibly \_ will match literal > _, and \\ will match literal \, or something like that.) I can guarantee that the directory names are all '/var/spool/news/message.id/' then 3 digits. (The filenames are pretty wild, since they are MIDs.) AIUI, using the '?' substitution in the sqlite3 library is supposed to be safe. > This is not bypassing the database's optimization; in fact, it's > working tidily within it. That's reassuring! ... > But doing the set difference in Python is just as good a way of doing the job. I like it. Thanks very much. -- Specifications are for the weak & timid! --- Klingon Programmer's Guide -- https://mail.python.org/mailman/listinfo/python-list