[sqlite] Multi-table index ersatz?

2015-03-04 Thread Hick Gunter
Properly implemented virtual tables do support indexing, but you have to write 
the code to support that yourself.

I have personally implemented an index based on the fastbit package which is 
ideally suited to retrieving large data sets via equality and range constraints.

See https://sdm.lbl.gov/fastbit/

-Urspr?ngliche Nachricht-
Von: Eric Grange [mailto:zarglu at gmail.com]
Gesendet: Mittwoch, 04. M?rz 2015 08:24
An: General Discussion of SQLite Database
Betreff: Re: [sqlite] Multi-table index ersatz?

> Rowids will be faster than primary keys.

My primary keys are ROWIDs ("INTEGER PRIMARY KEY" actually)

None of the index was exploited for the order by, and the matched records in 
table A are scattered in pages all over the database, so ordering them in 
memory has a tendency to "replace" the whole SQLite cache: first time a query 
is run, it's slow, second time, it's fast, but if you change the condition 
value (?1) then it's slow again as the page cache is invalidated (it is very 
visible in the resource monitor, with a disk access spike)

> You might be able to make the new table a WITHOUT ROWID table and set
> its
PRIMARY KEY up with the same (or a superset of the) fields of your "fairly 
large index" in order to save a bit of space.

I have been experimenting that way, and actually since A1 and B1 should fit 
32bits integers for the foreseeable future, combining them into a 64bit integer 
is possible, and I use (A1 << 32) | B1  as "INTEGER PRIMARY KEY"
(ROWID). This makes a separate composite index unnecessary as the primary key 
becomes the composite index: the equality condition becomes a range condition 
on the rowid, with an order by on the rowid, both being fast and cache-friendly.

It reduces disk usage significantly over the previous full-blown C table + 
composite index, it is still a sort of manually-managed hacky index, which 
involves extra queries to maintain it. But at the moment it seems to be the 
"better" solution.

> It might be possible to write a virtual table module that does the
> same as your index on C, but with C being a view.

I had a look that way, but AFAICT virtual tables do not support indexing, so I 
would have to index manually.


On Tue, Mar 3, 2015 at 4:57 PM, Dan Kennedy  wrote:

> On 03/03/2015 06:10 PM, Eric Grange wrote:
>
>> Hi,
>>
>> I have problem where I need a "multi-table index" ersatz, or maybe a
>> better data structure :-)
>>
>> The problem is as follow:
>>
>> - Table A : some fields plus fields A1 & A2
>> - Table B : some fields plus fields B1 & B2
>>
>> Both tables have several dozen millions of rows, and both are
>> accessed independently of each others by some queries, their current
>> structure has no performance issues for those queries.
>>
>> However I have a new query which is like
>>
>> select ...some fields of A & B...
>> from A join B on A.A2 = B.B2
>> where A.A1 = ?1
>> order by B.B1
>> limit 100
>>
>>
>> Without the limit, there can be tens of thousandths resulting rows,
>> without the A1 condition, there can be millions of resulting rows.
>>
>> With indexes on A & B, the performance of the above is not very good,
>> as indexing A1 is not enough, and indexing B1 is not enough either,
>> so no query plan is satisfying.
>>
>> I can make the query instantaneous by duplicating the A1 & B1 fields
>> in a dedicated C table (along with the primary keys of A & B), index
>> that table, and then join back the A & B table to get the other
>> fields.
>>
>> However this results in a fairly large table of duplicated data,
>> whose sole purpose is to allow the creation of a fairly large index,
>> which gets me the performance.
>>
>
> You might be able to make the new table a WITHOUT ROWID table and set
> its PRIMARY KEY up with the same (or a superset of the) fields of your
> "fairly large index" in order to save a bit of space.
>
>
>
>
>
>> Note that if the fields A1 & B1 are removed from their tables and
>> kept only in C, this has massive performance implication on other
>> queries running only against A & B, as those fields are leveraged in
>> other composite indexes.
>>
>> Is there a better way that would not involve duplicating the data?
>>
>> Eric
>> ___
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> ___
> sqlite-users mailing list
> sqlite-users at ma

[sqlite] Multi-table index ersatz?

2015-03-04 Thread Eric Grange
> Rowids will be faster than primary keys.

My primary keys are ROWIDs ("INTEGER PRIMARY KEY" actually)

None of the index was exploited for the order by, and the matched records
in table A are scattered in pages all over the database, so ordering them
in memory has a tendency to "replace" the whole SQLite cache: first time a
query is run, it's slow, second time, it's fast, but if you change the
condition value (?1) then it's slow again as the page cache is invalidated
(it is very visible in the resource monitor, with a disk access spike)

> You might be able to make the new table a WITHOUT ROWID table and set its
PRIMARY KEY up with the same (or a superset of the) fields of your "fairly
large index" in order to save a bit of space.

I have been experimenting that way, and actually since A1 and B1 should fit
32bits integers for the foreseeable future, combining them into a 64bit
integer is possible, and I use (A1 << 32) | B1  as "INTEGER PRIMARY KEY"
(ROWID). This makes a separate composite index unnecessary as the primary
key becomes the composite index: the equality condition becomes a range
condition on the rowid, with an order by on the rowid, both being fast and
cache-friendly.

It reduces disk usage significantly over the previous full-blown C table +
composite index, it is still a sort of manually-managed hacky index, which
involves extra queries to maintain it. But at the moment it seems to be the
"better" solution.

> It might be possible to write a virtual table module that does the same
> as your index on C, but with C being a view.

I had a look that way, but AFAICT virtual tables do not support indexing,
so I would have to index manually.


On Tue, Mar 3, 2015 at 4:57 PM, Dan Kennedy  wrote:

> On 03/03/2015 06:10 PM, Eric Grange wrote:
>
>> Hi,
>>
>> I have problem where I need a "multi-table index" ersatz, or maybe a
>> better
>> data structure :-)
>>
>> The problem is as follow:
>>
>> - Table A : some fields plus fields A1 & A2
>> - Table B : some fields plus fields B1 & B2
>>
>> Both tables have several dozen millions of rows, and both are accessed
>> independently of each others by some queries, their current structure has
>> no performance issues for those queries.
>>
>> However I have a new query which is like
>>
>> select ...some fields of A & B...
>> from A join B on A.A2 = B.B2
>> where A.A1 = ?1
>> order by B.B1
>> limit 100
>>
>>
>> Without the limit, there can be tens of thousandths resulting rows,
>> without
>> the A1 condition, there can be millions of resulting rows.
>>
>> With indexes on A & B, the performance of the above is not very good, as
>> indexing A1 is not enough, and indexing B1 is not enough either, so no
>> query plan is satisfying.
>>
>> I can make the query instantaneous by duplicating the A1 & B1 fields in a
>> dedicated C table (along with the primary keys of A & B), index that
>> table,
>> and then join back the A & B table to get the other fields.
>>
>> However this results in a fairly large table of duplicated data, whose
>> sole
>> purpose is to allow the creation of a fairly large index, which gets me
>> the
>> performance.
>>
>
> You might be able to make the new table a WITHOUT ROWID table and set its
> PRIMARY KEY up with the same (or a superset of the) fields of your "fairly
> large index" in order to save a bit of space.
>
>
>
>
>
>> Note that if the fields A1 & B1 are removed from their tables and kept
>> only
>> in C, this has massive performance implication on other queries running
>> only against A & B, as those fields are leveraged in other composite
>> indexes.
>>
>> Is there a better way that would not involve duplicating the data?
>>
>> Eric
>> ___
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Multi-table index ersatz?

2015-03-03 Thread Dan Kennedy
On 03/03/2015 06:10 PM, Eric Grange wrote:
> Hi,
>
> I have problem where I need a "multi-table index" ersatz, or maybe a better
> data structure :-)
>
> The problem is as follow:
>
> - Table A : some fields plus fields A1 & A2
> - Table B : some fields plus fields B1 & B2
>
> Both tables have several dozen millions of rows, and both are accessed
> independently of each others by some queries, their current structure has
> no performance issues for those queries.
>
> However I have a new query which is like
>
> select ...some fields of A & B...
> from A join B on A.A2 = B.B2
> where A.A1 = ?1
> order by B.B1
> limit 100
>
>
> Without the limit, there can be tens of thousandths resulting rows, without
> the A1 condition, there can be millions of resulting rows.
>
> With indexes on A & B, the performance of the above is not very good, as
> indexing A1 is not enough, and indexing B1 is not enough either, so no
> query plan is satisfying.
>
> I can make the query instantaneous by duplicating the A1 & B1 fields in a
> dedicated C table (along with the primary keys of A & B), index that table,
> and then join back the A & B table to get the other fields.
>
> However this results in a fairly large table of duplicated data, whose sole
> purpose is to allow the creation of a fairly large index, which gets me the
> performance.

You might be able to make the new table a WITHOUT ROWID table and set 
its PRIMARY KEY up with the same (or a superset of the) fields of your 
"fairly large index" in order to save a bit of space.




>
> Note that if the fields A1 & B1 are removed from their tables and kept only
> in C, this has massive performance implication on other queries running
> only against A & B, as those fields are leveraged in other composite
> indexes.
>
> Is there a better way that would not involve duplicating the data?
>
> Eric
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



[sqlite] Multi-table index ersatz?

2015-03-03 Thread Hick Gunter
Let's construct this example from the reverse.

The object is to avoid the sort at the end (sorting "millions" to return 100 is 
a bad tradeoff), so the B table needs to be visited in B1 order.

-> outer loop = B
-> inner loop = A
-> index B on (B1,...)

The join is on A2 = B2

->index A on (A2,...)

The WHERE clause specifies to quickly determine the subset of rows that have a 
specific value of A1

->index A on (A1,...)

To minimize the number of rows visited in A, the more specific field should 
come first

-> index A on (A1,A2,...) if card(A1) > card(A2)

Which results in

CREATE INDEX B_B1 on B (B1);
CREATE INDEX A_A1_A2 on A (A1, A2);

SELECT ... FROM B CROSS JOIN A ON (A.A1 = ? AND A.A2 = B.B2) ORDER BY B.B1 
LIMIT 100;

For reversed cardinality, only the index on A needs to switch field order.


Rowids will be faster than primary keys.

CREATE TABE C AS SELECT ( A.A1, B.B1, A.rowid AS IDA, B.rowid AS IDB FROM A 
JOIN B ON A.A2 = B.B2);
CREATE INDEX ON C (A1,B1);

SELECT  FROM C JOIN A ON A.ROWID = C.IDA JOIN B ON B.ROWID = C.IDB WHERE 
C.A1 = ? ORDER BY C.B1 LIMIT 100;


-Urspr?ngliche Nachricht-
Von: Eric Grange [mailto:zarglu at gmail.com]
Gesendet: Dienstag, 03. M?rz 2015 12:10
An: General Discussion of SQLite Database
Betreff: [sqlite] Multi-table index ersatz?

Hi,

I have problem where I need a "multi-table index" ersatz, or maybe a better 
data structure :-)

The problem is as follow:

   - Table A : some fields plus fields A1 & A2
   - Table B : some fields plus fields B1 & B2

Both tables have several dozen millions of rows, and both are accessed 
independently of each others by some queries, their current structure has no 
performance issues for those queries.

However I have a new query which is like

select ...some fields of A & B...
from A join B on A.A2 = B.B2
where A.A1 = ?1
order by B.B1
limit 100


Without the limit, there can be tens of thousandths resulting rows, without the 
A1 condition, there can be millions of resulting rows.

With indexes on A & B, the performance of the above is not very good, as 
indexing A1 is not enough, and indexing B1 is not enough either, so no query 
plan is satisfying.

I can make the query instantaneous by duplicating the A1 & B1 fields in a 
dedicated C table (along with the primary keys of A & B), index that table, and 
then join back the A & B table to get the other fields.

However this results in a fairly large table of duplicated data, whose sole 
purpose is to allow the creation of a fairly large index, which gets me the 
performance.

Note that if the fields A1 & B1 are removed from their tables and kept only in 
C, this has massive performance implication on other queries running only 
against A & B, as those fields are leveraged in other composite indexes.

Is there a better way that would not involve duplicating the data?

Eric
___
sqlite-users mailing list
sqlite-users at mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
 Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: hick at scigames.at

This communication (including any attachments) is intended for the use of the 
intended recipient(s) only and may contain information that is confidential, 
privileged or legally protected. Any unauthorized use or dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please immediately notify the sender by return e-mail message and 
delete all copies of the original communication. Thank you for your cooperation.




[sqlite] Multi-table index ersatz?

2015-03-03 Thread Clemens Ladisch
Eric Grange wrote:
> select ...some fields of A & B...
> from A join B on A.A2 = B.B2
> where A.A1 = ?1
> order by B.B1
> limit 100
>
> Without the limit, there can be tens of thousandths resulting rows,

Even with the limit, all the tens of thousands rows must be sorted.

> without the A1 condition, there can be millions of resulting rows.
>
> With indexes on A & B, the performance of the above is not very good, as
> indexing A1 is not enough, and indexing B1 is not enough either, so no
> query plan is satisfying.

According to your numbers, the index on A1 is more important, which
implies that the sorting must be done without the help of an index.

> I can make the query instantaneous by duplicating the A1 & B1 fields in a
> dedicated C table (along with the primary keys of A & B), index that table,
> and then join back the A & B table to get the other fields.
> [...]
> Is there a better way that would not involve duplicating the data?

An index _is_ somthing like a table containing duplicated data; the
difference is that it is maintained automatically.  You can get the
same effect with triggers to maintain the C table.

SQLite does not have multi-table indexes.

It might be possible to write a virtual table module that does the same
as your index on C, but with C being a view.


Regards,
Clemens


[sqlite] Multi-table index ersatz?

2015-03-03 Thread Eric Grange
Yes A2 & B2 are already indexed (individually and in composite indexes)
The problem is that this indexing is not selective enough when taken in
isolation.
Le 3 mars 2015 12:36, "Simon Davies"  a ?crit
:

> On 3 March 2015 at 11:10, Eric Grange  wrote:
> >
> > Hi,
> >
> > I have problem where I need a "multi-table index" ersatz, or maybe a
> better
> > data structure :-)
> >
> > The problem is as follow:
> >
> >- Table A : some fields plus fields A1 & A2
> >- Table B : some fields plus fields B1 & B2
> >
> > Both tables have several dozen millions of rows, and both are accessed
> > independently of each others by some queries, their current structure has
> > no performance issues for those queries.
> >
> > However I have a new query which is like
> >
> > select ...some fields of A & B...
> > from A join B on A.A2 = B.B2
> > where A.A1 = ?1
> > order by B.B1
> > limit 100
> >
> > Without the limit, there can be tens of thousandths resulting rows,
> without
> > the A1 condition, there can be millions of resulting rows.
> >
> > With indexes on A & B, the performance of the above is not very good, as
> > indexing A1 is not enough, and indexing B1 is not enough either, so no
> > query plan is satisfying.
>
> Have you tried indexing on A2?
>
> .
> .
> .
> > Is there a better way that would not involve duplicating the data?
> >
> > Eric
>
> Regards,
> Simon
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Multi-table index ersatz?

2015-03-03 Thread Simon Slavin

On 3 Mar 2015, at 11:10am, Eric Grange  wrote:

> With indexes on A & B, the performance of the above is not very good, as
> indexing A1 is not enough, and indexing B1 is not enough either, so no
> query plan is satisfying.

The B1 index isn't going to be used.  Here is your query:

select ...some fields of A & B...
from A join B on A.A2 = B.B2
where A.A1 = ?1
order by B.B1
limit 100

An index on A.A1 is good, since it satisfies the "WHERE" clause.  So do that as 
you did.

Once SQLite has found relevant rows of A it's processing the "JOIN".  This 
means going through table B.B2 looking for rows which match existing values of 
A.A2.  You need an index on B.B2.  But having done that the "ORDER BY" clause 
means it needs to order the resulting rows by B.B1.  So one index which allowed 
both might provide the best advantage.  Try

CREATE INDEX B_2_1 ON B (B2, B1)

then do an ANALYZE, and see if that helps matters.  Another approach is to look 
at your "ORDER BY" clause and see that what you're doing is based on ordering 
by B.B1, sp another way to see your query is

select ...some fields of A & B...
from B join A on A.A2 = B.B2
where A.A1 = ?1
order by B.B1
limit 100

This would be best helped with an index on B.B1 and another on A.A1 and A.A2.  
I suggest you do

CREATE INDEX B_1_2 ON B (B1, B2)
CREATE INDEX B_2_1 ON B (B2, B1)
CREATE INDEX A_1_2 ON A (A1, A2)
CREATE INDEX A_2_1 ON A (A2, A1)

then do an ANALYZE, then execute the query as rephrased above.  If this turns 
out to be faster you can use EXPLAIN QUERY PLAN to find out which indexes 
SQLite is doing and delete the unused ones.

Simon.


[sqlite] Multi-table index ersatz?

2015-03-03 Thread Eric Grange
Hi,

I have problem where I need a "multi-table index" ersatz, or maybe a better
data structure :-)

The problem is as follow:

   - Table A : some fields plus fields A1 & A2
   - Table B : some fields plus fields B1 & B2

Both tables have several dozen millions of rows, and both are accessed
independently of each others by some queries, their current structure has
no performance issues for those queries.

However I have a new query which is like

select ...some fields of A & B...
from A join B on A.A2 = B.B2
where A.A1 = ?1
order by B.B1
limit 100


Without the limit, there can be tens of thousandths resulting rows, without
the A1 condition, there can be millions of resulting rows.

With indexes on A & B, the performance of the above is not very good, as
indexing A1 is not enough, and indexing B1 is not enough either, so no
query plan is satisfying.

I can make the query instantaneous by duplicating the A1 & B1 fields in a
dedicated C table (along with the primary keys of A & B), index that table,
and then join back the A & B table to get the other fields.

However this results in a fairly large table of duplicated data, whose sole
purpose is to allow the creation of a fairly large index, which gets me the
performance.

Note that if the fields A1 & B1 are removed from their tables and kept only
in C, this has massive performance implication on other queries running
only against A & B, as those fields are leveraged in other composite
indexes.

Is there a better way that would not involve duplicating the data?

Eric


[sqlite] Multi-table index ersatz?

2015-03-03 Thread Simon Davies
On 3 March 2015 at 11:10, Eric Grange  wrote:
>
> Hi,
>
> I have problem where I need a "multi-table index" ersatz, or maybe a better
> data structure :-)
>
> The problem is as follow:
>
>- Table A : some fields plus fields A1 & A2
>- Table B : some fields plus fields B1 & B2
>
> Both tables have several dozen millions of rows, and both are accessed
> independently of each others by some queries, their current structure has
> no performance issues for those queries.
>
> However I have a new query which is like
>
> select ...some fields of A & B...
> from A join B on A.A2 = B.B2
> where A.A1 = ?1
> order by B.B1
> limit 100
>
> Without the limit, there can be tens of thousandths resulting rows, without
> the A1 condition, there can be millions of resulting rows.
>
> With indexes on A & B, the performance of the above is not very good, as
> indexing A1 is not enough, and indexing B1 is not enough either, so no
> query plan is satisfying.

Have you tried indexing on A2?

. 
. 
. 
> Is there a better way that would not involve duplicating the data?
>
> Eric

Regards,
Simon