Re: [sqlite] Improve query performance using a join

2014-08-19 Thread Simon Slavin

On 19 Aug 2014, at 8:25am, Paul Dillon  wrote:

> I was
> using count(first_field) instead of selecting all the fields, can't imagine
> that could be the problem.

There's an optimization in SQLite which means you can do COUNT(*) and it will 
fetch no data at all.  It's faster than doing any COUNT(specific_field).

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improve query performance using a join

2014-08-19 Thread Paul Dillon
Jim Callahan jim.callahan.orlando at gmail.com wrote:

> 1. My condolences with those dimensions you are heading for "big> 
> data"/hadoop land.

Heheh thanks, I am so determined not to go there!  SQLite has been
such a nice simple database to use, I will do anything to avoid all
the complexity of those map-reduce solutions.

> 2. Worry about the number of rows; that's what feeds into the big-oh: O(n).
> Assuming your 150 columns translate into a 1.5k to 2k record length that
> means your 300 gigabyte file must have 150 (300 gig/2k) to 200 (300
> gig/1.5k) million records. That's a lot of n for O(n).

I've got about 200 million rows of 1,100 bytes each.  At least it's
linear.  I would love to partition the rows into 4 databases and query
them in parallel.  But that's just a fantasy, I can live with it for
now.

> So, I would recommend double checking the specs of your SSD and not
> necessarily making the reasonable, but not yet true, assumption that
> solid-state has to be faster than mechanical.

I use an SSD for other reasons.  However, I've tested sustained
sequential read and I am getting 550MB/s.  I'm doing full table scans
so it's sequential.  This is on a laptop, and I am fairly sure there
are no mechanical 2.5" HDDs that can transfer 200GB at 550MB/s.

> One strategy that might work is to have an entirely separate (not joined) 8
> column table; develop queries (targets) on that database and then write out
> the primary key of the rows you are interested in to a separate table
> (CREATE TABLE AS SELECT primarykey FROM 8columntable WHERE yourquery;) and
> then JOIN the row reduced table to the main table. If your rowreduced table
> has millions of fewer records (primary keys) that's millions of rows (in
> the main 150 column table) where the precompiled SQL query doesn't have to
> be executed.

Cheers for that, I'll do some testing and see how I go!

Thanks,

Paul
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improve query performance using a join

2014-08-19 Thread Paul Dillon
Thanks to all that replied to my post and sorry for the delayed response.
 I had trouble joining the list and had not realised that my post went
through until I tried to join the list again.

Simon Slavin slavins at bigfraud.org wrote:

> What you might find increases your speed is to make sure that those 8 columns
> are the first 8 columns listed in your table definition (after the primary
> key, if any).  Each time you have SQLite read the columns for a row it has to
> read the whole row up to the last column it needs.  So if all the unwanted

Thanks, intriguing idea.  Unfortunately made no difference for my
situation.  I made sure the table was too large to be cached, and I was
using count(first_field) instead of selecting all the fields, can't imagine
that could be the problem.

> Lastly, do not fall into the trap of premature optimization.  You should not
> be looking for your program to run "as fast as possible".  You should be > 
> looking for it to run "acceptably fast".  You can spend 20 hours of 
> programming to improve
> runtime by 1 minute -- a minute that your users wouldn't care about because 
> they always go
> make a cup of coffee during a run anyway.

Yeah I am wary of that.  The queries take about 20 minutes, and really
disrupts my workflow.  If I spend a few hours on this and it works,
it'll save me hundreds of hours in the long run.  I like to collect
anecdotal performance advice, and implement it when the cost of doing
so is not much more than not doing it.  I have some evidence that the
approach has worked for me in the past.

Thanks,

Paul
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improve query performance using a join

2014-08-08 Thread Dominique Devienne
On Thu, Aug 7, 2014 at 5:51 AM, Paul Dillon  wrote:

> 1. Will moving these 8 query fields to a smaller table improve query
> performance when joined to the larger table?  My logic is that this small
> table would only be about 5% the size of the full table, so the full table
> scan to service the query might be faster.
>

An alternative that could allow you to cheaply evaluate whether having the
smaller 8-column table would be to index those 8 columns.

See 1.7 Covering Indices from http://www.sqlite.org/queryplanner.html.

Yes, you'd duplicate the data, and your DB would grow by 5% from your own
admission, but then SQLite would be able to "route" queries selecting from
that 8-column subset to the index and not read the table at all, and read
the full table for other queries, transparently for you (analyze, to get
good stats, which will take a long time, and then verify using explain
query plan http://www.sqlite.org/eqp.html whether queries access the index,
the table, or both). If you adjust your cache size to match or exceed the
index size, you might be able to avoid disk IO altogether later, if allof
your queries stayed within that subset, although it sounded like you load
everything upfront which implies you don't query much later.

The advantage of the above it that it requires very little effort from you,
just a create index basically, to test that idea and see if it helps. --DD
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improve query performance using a join

2014-08-07 Thread Jim Callahan
A few observations
1. My condolences with those dimensions you are heading for "big
data"/hadoop land.

2. Worry about the number of rows; that's what feeds into the big-oh: O(n).
Assuming your 150 columns translate into a 1.5k to 2k record length that
means your 300 gigabyte file must have 150 (300 gig/2k) to 200 (300
gig/1.5k) million records. That's a lot of n for O(n).

3. SSDs are faster than spinning disks because they have potentially zero
track to track seek times. On the other hand, once the read-write head is
above the correct disk sector disks have very highly evolved I/O for
consecutive sectors (not fragmented). By contrast, the controllers that
provide the interface to SSDs are not nearly as evolved so SSDs may still
have lower bandwidth (for example a "Class 10" SD card designed for HD
Video still (unless otherwise marked) only has 10 Mbyte per second speed).
http://en.wikipedia.org/wiki/Secure_Digital

However, a device speed of 10 Mbyte per second would put a SCSI device near
the bottom of the speed hierarchy.
http://en.wikipedia.org/wiki/SCSI

Serial ATA (SATA) is an even faster interface.
http://en.wikipedia.org/wiki/Serial_ATA

So, I would recommend double checking the specs of your SSD and not
necessarily making the reasonable, but not yet true, assumption that
solid-state has to be faster than mechanical.

One strategy that might work is to have an entirely separate (not joined) 8
column table; develop queries (targets) on that database and then write out
the primary key of the rows you are interested in to a separate table
(CREATE TABLE AS SELECT primarykey FROM 8columntable WHERE yourquery;) and
then JOIN the row reduced table to the main table. If your rowreduced table
has millions of fewer records (primary keys) that's millions of rows (in
the main 150 column table) where the precompiled SQL query doesn't have to
be executed.

HTH,

Jim Callahan
Orlando, FL














On Wed, Aug 6, 2014 at 11:51 PM, Paul Dillon  wrote:

> Hello,
>
> I would like to know if splitting a big table into two smaller ones, and
> then using a join in my queries would speed up performance.
>
> My table is 100-300GB in size and has about 150 columns.  There are 8
> fields that I frequently use in my queries, which require full table scans
> to complete.  I usually query each field once per data load, so the time to
> index them is not worth it.  (FYI I am data mining).
>
> So my questions are:
>
> 1. Will moving these 8 query fields to a smaller table improve query
> performance when joined to the larger table?  My logic is that this small
> table would only be about 5% the size of the full table, so the full table
> scan to service the query might be faster.
>
> 2. Would it be worthwhile having that smaller table in a separate .db file,
> so that the data is closer together on my SSD drive?
>
> 3. Would the data loading performance be heavily impacted if I had to
> insert into two tables instead of one?  I use "INSERT OR REPLACE" for my
> loading, with a single index.
>
> Many Thanks,
>
> Paul
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improve query performance using a join

2014-08-07 Thread Simon Slavin

On 7 Aug 2014, at 4:51am, Paul Dillon  wrote:

> 1. Will moving these 8 query fields to a smaller table improve query
> performance when joined to the larger table?  My logic is that this small
> table would only be about 5% the size of the full table, so the full table
> scan to service the query might be faster.

This depends on how big the smaller table would be compared to how much cache 
your system uses for your database file.  It's impossible for us to answer 
because it depends a great deal on your specific hardware, OS, storage drivers, 
and background processes.  Not only is it difficult to test under realistic 
conditions, but you may upgrade your computer later this year and find the same 
test gets the other result because relative sizes of rows, pages and caches 
have changed.

> 2. Would it be worthwhile having that smaller table in a separate .db file,
> so that the data is closer together on my SSD drive?

Still difficult to say, for the same reasons as listed above.

What you might find increases your speed is to make sure that those 8 columns 
are the first 8 columns listed in your table definition (after the primary key, 
if any).  Each time you have SQLite read the columns for a row it has to read 
the whole row up to the last column it needs.  So if all the unwanted columns 
are after these it will only need to read these 8 columns from the file.  This 
change involves no extra programming and no extra storage, so it's very "cheap".

> 3. Would the data loading performance be heavily impacted if I had to
> insert into two tables instead of one?  I use "INSERT OR REPLACE" for my
> loading, with a single index.

This would be the drawback which might stop me from making the change.  The 
other would be that your programming would be more complicated, which takes 
time and would make debugging harder.

Lastly, do not fall into the trap of premature optimization.  You should not be 
looking for your program to run "as fast as possible".  You should be looking 
for it to run "acceptably fast".  You can spend 20 hours of programming to 
improve runtime by 1 minute -- a minute that your users wouldn't care about 
because they always go make a cup of coffee during a run anyway.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Improve query performance using a join

2014-08-07 Thread Paul Dillon
Hello,

I would like to know if splitting a big table into two smaller ones, and
then using a join in my queries would speed up performance.

My table is 100-300GB in size and has about 150 columns.  There are 8
fields that I frequently use in my queries, which require full table scans
to complete.  I usually query each field once per data load, so the time to
index them is not worth it.  (FYI I am data mining).

So my questions are:

1. Will moving these 8 query fields to a smaller table improve query
performance when joined to the larger table?  My logic is that this small
table would only be about 5% the size of the full table, so the full table
scan to service the query might be faster.

2. Would it be worthwhile having that smaller table in a separate .db file,
so that the data is closer together on my SSD drive?

3. Would the data loading performance be heavily impacted if I had to
insert into two tables instead of one?  I use "INSERT OR REPLACE" for my
loading, with a single index.

Many Thanks,

Paul
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users