Re: [sqlite] Improve query performance using a join
On 19 Aug 2014, at 8:25am, Paul Dillonwrote: > I was > using count(first_field) instead of selecting all the fields, can't imagine > that could be the problem. There's an optimization in SQLite which means you can do COUNT(*) and it will fetch no data at all. It's faster than doing any COUNT(specific_field). Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improve query performance using a join
Jim Callahan jim.callahan.orlando at gmail.com wrote: > 1. My condolences with those dimensions you are heading for "big> > data"/hadoop land. Heheh thanks, I am so determined not to go there! SQLite has been such a nice simple database to use, I will do anything to avoid all the complexity of those map-reduce solutions. > 2. Worry about the number of rows; that's what feeds into the big-oh: O(n). > Assuming your 150 columns translate into a 1.5k to 2k record length that > means your 300 gigabyte file must have 150 (300 gig/2k) to 200 (300 > gig/1.5k) million records. That's a lot of n for O(n). I've got about 200 million rows of 1,100 bytes each. At least it's linear. I would love to partition the rows into 4 databases and query them in parallel. But that's just a fantasy, I can live with it for now. > So, I would recommend double checking the specs of your SSD and not > necessarily making the reasonable, but not yet true, assumption that > solid-state has to be faster than mechanical. I use an SSD for other reasons. However, I've tested sustained sequential read and I am getting 550MB/s. I'm doing full table scans so it's sequential. This is on a laptop, and I am fairly sure there are no mechanical 2.5" HDDs that can transfer 200GB at 550MB/s. > One strategy that might work is to have an entirely separate (not joined) 8 > column table; develop queries (targets) on that database and then write out > the primary key of the rows you are interested in to a separate table > (CREATE TABLE AS SELECT primarykey FROM 8columntable WHERE yourquery;) and > then JOIN the row reduced table to the main table. If your rowreduced table > has millions of fewer records (primary keys) that's millions of rows (in > the main 150 column table) where the precompiled SQL query doesn't have to > be executed. Cheers for that, I'll do some testing and see how I go! Thanks, Paul ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improve query performance using a join
Thanks to all that replied to my post and sorry for the delayed response. I had trouble joining the list and had not realised that my post went through until I tried to join the list again. Simon Slavin slavins at bigfraud.org wrote: > What you might find increases your speed is to make sure that those 8 columns > are the first 8 columns listed in your table definition (after the primary > key, if any). Each time you have SQLite read the columns for a row it has to > read the whole row up to the last column it needs. So if all the unwanted Thanks, intriguing idea. Unfortunately made no difference for my situation. I made sure the table was too large to be cached, and I was using count(first_field) instead of selecting all the fields, can't imagine that could be the problem. > Lastly, do not fall into the trap of premature optimization. You should not > be looking for your program to run "as fast as possible". You should be > > looking for it to run "acceptably fast". You can spend 20 hours of > programming to improve > runtime by 1 minute -- a minute that your users wouldn't care about because > they always go > make a cup of coffee during a run anyway. Yeah I am wary of that. The queries take about 20 minutes, and really disrupts my workflow. If I spend a few hours on this and it works, it'll save me hundreds of hours in the long run. I like to collect anecdotal performance advice, and implement it when the cost of doing so is not much more than not doing it. I have some evidence that the approach has worked for me in the past. Thanks, Paul ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improve query performance using a join
On Thu, Aug 7, 2014 at 5:51 AM, Paul Dillonwrote: > 1. Will moving these 8 query fields to a smaller table improve query > performance when joined to the larger table? My logic is that this small > table would only be about 5% the size of the full table, so the full table > scan to service the query might be faster. > An alternative that could allow you to cheaply evaluate whether having the smaller 8-column table would be to index those 8 columns. See 1.7 Covering Indices from http://www.sqlite.org/queryplanner.html. Yes, you'd duplicate the data, and your DB would grow by 5% from your own admission, but then SQLite would be able to "route" queries selecting from that 8-column subset to the index and not read the table at all, and read the full table for other queries, transparently for you (analyze, to get good stats, which will take a long time, and then verify using explain query plan http://www.sqlite.org/eqp.html whether queries access the index, the table, or both). If you adjust your cache size to match or exceed the index size, you might be able to avoid disk IO altogether later, if allof your queries stayed within that subset, although it sounded like you load everything upfront which implies you don't query much later. The advantage of the above it that it requires very little effort from you, just a create index basically, to test that idea and see if it helps. --DD ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improve query performance using a join
A few observations 1. My condolences with those dimensions you are heading for "big data"/hadoop land. 2. Worry about the number of rows; that's what feeds into the big-oh: O(n). Assuming your 150 columns translate into a 1.5k to 2k record length that means your 300 gigabyte file must have 150 (300 gig/2k) to 200 (300 gig/1.5k) million records. That's a lot of n for O(n). 3. SSDs are faster than spinning disks because they have potentially zero track to track seek times. On the other hand, once the read-write head is above the correct disk sector disks have very highly evolved I/O for consecutive sectors (not fragmented). By contrast, the controllers that provide the interface to SSDs are not nearly as evolved so SSDs may still have lower bandwidth (for example a "Class 10" SD card designed for HD Video still (unless otherwise marked) only has 10 Mbyte per second speed). http://en.wikipedia.org/wiki/Secure_Digital However, a device speed of 10 Mbyte per second would put a SCSI device near the bottom of the speed hierarchy. http://en.wikipedia.org/wiki/SCSI Serial ATA (SATA) is an even faster interface. http://en.wikipedia.org/wiki/Serial_ATA So, I would recommend double checking the specs of your SSD and not necessarily making the reasonable, but not yet true, assumption that solid-state has to be faster than mechanical. One strategy that might work is to have an entirely separate (not joined) 8 column table; develop queries (targets) on that database and then write out the primary key of the rows you are interested in to a separate table (CREATE TABLE AS SELECT primarykey FROM 8columntable WHERE yourquery;) and then JOIN the row reduced table to the main table. If your rowreduced table has millions of fewer records (primary keys) that's millions of rows (in the main 150 column table) where the precompiled SQL query doesn't have to be executed. HTH, Jim Callahan Orlando, FL On Wed, Aug 6, 2014 at 11:51 PM, Paul Dillonwrote: > Hello, > > I would like to know if splitting a big table into two smaller ones, and > then using a join in my queries would speed up performance. > > My table is 100-300GB in size and has about 150 columns. There are 8 > fields that I frequently use in my queries, which require full table scans > to complete. I usually query each field once per data load, so the time to > index them is not worth it. (FYI I am data mining). > > So my questions are: > > 1. Will moving these 8 query fields to a smaller table improve query > performance when joined to the larger table? My logic is that this small > table would only be about 5% the size of the full table, so the full table > scan to service the query might be faster. > > 2. Would it be worthwhile having that smaller table in a separate .db file, > so that the data is closer together on my SSD drive? > > 3. Would the data loading performance be heavily impacted if I had to > insert into two tables instead of one? I use "INSERT OR REPLACE" for my > loading, with a single index. > > Many Thanks, > > Paul > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improve query performance using a join
On 7 Aug 2014, at 4:51am, Paul Dillonwrote: > 1. Will moving these 8 query fields to a smaller table improve query > performance when joined to the larger table? My logic is that this small > table would only be about 5% the size of the full table, so the full table > scan to service the query might be faster. This depends on how big the smaller table would be compared to how much cache your system uses for your database file. It's impossible for us to answer because it depends a great deal on your specific hardware, OS, storage drivers, and background processes. Not only is it difficult to test under realistic conditions, but you may upgrade your computer later this year and find the same test gets the other result because relative sizes of rows, pages and caches have changed. > 2. Would it be worthwhile having that smaller table in a separate .db file, > so that the data is closer together on my SSD drive? Still difficult to say, for the same reasons as listed above. What you might find increases your speed is to make sure that those 8 columns are the first 8 columns listed in your table definition (after the primary key, if any). Each time you have SQLite read the columns for a row it has to read the whole row up to the last column it needs. So if all the unwanted columns are after these it will only need to read these 8 columns from the file. This change involves no extra programming and no extra storage, so it's very "cheap". > 3. Would the data loading performance be heavily impacted if I had to > insert into two tables instead of one? I use "INSERT OR REPLACE" for my > loading, with a single index. This would be the drawback which might stop me from making the change. The other would be that your programming would be more complicated, which takes time and would make debugging harder. Lastly, do not fall into the trap of premature optimization. You should not be looking for your program to run "as fast as possible". You should be looking for it to run "acceptably fast". You can spend 20 hours of programming to improve runtime by 1 minute -- a minute that your users wouldn't care about because they always go make a cup of coffee during a run anyway. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Improve query performance using a join
Hello, I would like to know if splitting a big table into two smaller ones, and then using a join in my queries would speed up performance. My table is 100-300GB in size and has about 150 columns. There are 8 fields that I frequently use in my queries, which require full table scans to complete. I usually query each field once per data load, so the time to index them is not worth it. (FYI I am data mining). So my questions are: 1. Will moving these 8 query fields to a smaller table improve query performance when joined to the larger table? My logic is that this small table would only be about 5% the size of the full table, so the full table scan to service the query might be faster. 2. Would it be worthwhile having that smaller table in a separate .db file, so that the data is closer together on my SSD drive? 3. Would the data loading performance be heavily impacted if I had to insert into two tables instead of one? I use "INSERT OR REPLACE" for my loading, with a single index. Many Thanks, Paul ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users