Re: Prohibitive amount of iterations in MetaTable.generateRows() when querying the INFORMATION_SCHEMA in a large database (many tables)

Lukas Eder Tue, 19 Feb 2013 08:54:39 -0800

> A 40s execution of my previous query against an INFORMATION_SCHEMA with 
10k tables, 10k primary keys and 10k foreign keys speeds up to a mere 
5-6ms!!


Obviously, I've made a mistake, I accidentally dropped my "large" schema. I 
didn't believe those 5-6ms could be true, myself ;-)
The effective improvement goes from 40s to 35s, which is still something, 
but maybe not worth the pain...

Am Dienstag, 19. Februar 2013 17:40:07 UTC+1 schrieb Lukas Eder:
>
> >> Patches are welcome of course, but I would like to keep it simple if 
> possible.
> >
> > I'll have a second look. There's always 1-2 things to be found when 
> profiling these cases. I'll report them here, in case I should find 
> something.
>
> OK, so this is the place where all the CPU power gets wasted: 
> MetaTable.checkIndex():
>
>             Value v = ValueString.get(value);
>             if (indexFrom != null && db.compare(v, indexFrom) < 0) {
>                 return false;
>             }
>             if (indexTo != null && db.compare(v, indexTo) > 0) {
>                 return false;
>             }
>
> In fact, ValueString.get() is a very expensive operation if called 100M 
> times, given the time it takes to always compare strings and cache them, 
> evict objects from the cache, etc. After playing around with a lot of 
> little tweaks, I was finally thinking that this checkIndex() method always 
> gets an "MetaTable.identifier(Table.getName())" argument for "String 
> value". So instead of re-wrapping this value into a ValueString several 
> millions of times, it would be much better to simply hold a constant 
> reference to a ValueString (and a ValueStringIgnoreCase) inside of 
> DbObjectBase. I think the H2 database can live with the slight overhead of 
> holding this reference.
>
> The effect is massive:
> A 40s execution of my previous query against an INFORMATION_SCHEMA with 
> 10k tables, 10k primary keys and 10k foreign keys speeds up to a mere 
> 5-6ms!!
>
> Please find attached a patch including the relevant changes. Note:
>
> - You may find it more appropriate to push down my new 
> DbObjectBase.getIdentifier() method to Table, or to cache these values on a 
> per-query basis. I don't know H2 well enough to assess, whether adding 
> cache values to DbObjectBase is a good idea.
> - MetaTable.identifier() has now gotten somewhat obsolete, if all 
> DbObjectBase objects can provide a cached identifier. To keep the change 
> slim, I avoided a refactoring to remove MetaTable.identifier(). The code 
> duplication is acceptable, I think
>
> Let me know if you need anything else
>
> Cheers
> Lukas
>
> Am Dienstag, 19. Februar 2013 11:56:01 UTC+1 schrieb Lukas Eder:
>>
>> Hi Thomas,
>>
>> > I thought about changing it, by using a 'before select' trigger, so 
>> that the metadata tables are materialized. 
>>
>> Yes, I had thought about that, too. Materialising views might be the path 
>> of least resistance here, specifically because all sorts of regular indexes 
>> can then be applied, and execution plans would be more sophisticated. 
>> Probably, Oracle-style materialised views would be quite a feature anyway 
>> for H2, in the long run.
>>
>> > Patches are welcome of course, but I would like to keep it simple if 
>> possible.
>>
>> I'll have a second look. There's always 1-2 things to be found when 
>> profiling these cases. I'll report them here, in case I should find 
>> something.
>>
>> > I'm not convinced that 15000 tables is a good use of a database :-)
>>
>> I wouldn't be, either...
>> I hardly ever have good and non-academic uses of a database in my set of 
>> use cases for jOOQ ;-)
>>
>> So I'll see you in 2 weeks! :-)
>>
>> Cheers
>> Lukas
>>
>> Am Montag, 18. Februar 2013 21:18:56 UTC+1 schrieb Thomas Mueller:
>>>
>>> Hi,
>>>
>>> Yes, the current metadata implementation does not scale very well. It is 
>>> made for about 100 tables, not 15000. For example, the complete schema 
>>> is always fully in memory at the moment.
>>>
>>> > Or is this a "bug" in H2
>>>
>>> I would say it's a limitation. I will document it.
>>>
>>> I thought about changing it, by using a 'before select' trigger, so that 
>>> the metadata tables are materialized. Before selecting from the 
>>> metadata tables, the 'before select' trigger re-builds them. That way the 
>>> current (slow) algorithms are still used, but then the result is cached. 
>>> That would improve performance a lot, unless if there are many metadata 
>>> changes. I never had the time to implement it, and so far only one person 
>>> complained (that was many years ago, using millions of tables as far as I 
>>> remember).
>>>
>>> Patches are welcome of course, but I would like to keep it simple if 
>>> possible. I'm not convinced that 15000 tables is a good use of a database 
>>> :-)
>>>
>>> Regards,
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 18, 2013 at 7:21 PM, Lukas Eder <[email protected]> wrote:
>>>
>>>> Here's a bit of a corner-case, showing a prohibitive amount of calls to 
>>>> MetaTable.checkIndex() and other methods within MetaTable.generateRows()
>>>>
>>>> 1. I've created a database schema with 15k tables (and 15k primary 
>>>> keys, 15k foreign keys) of this form:
>>>>
>>>> DROP SCHEMA IF EXISTS large;
>>>> CREATE SCHEMA large;
>>>> CREATE TABLE large.t00000 (id INT,              CONSTRAINT pk_00000 
>>>> PRIMARY KEY (id));
>>>> CREATE TABLE large.t00001 (id INT, prev_id INT, CONSTRAINT pk_00001 
>>>> PRIMARY KEY (id), CONSTRAINT fk_00001 FOREIGN KEY (prev_id) REFERENCES 
>>>> t00000(id));
>>>> CREATE TABLE large.t00002 (id INT, prev_id INT, CONSTRAINT pk_00002 
>>>> PRIMARY KEY (id), CONSTRAINT fk_00002 FOREIGN KEY (prev_id) REFERENCES 
>>>> t00001(id));
>>>> CREATE TABLE large.t00003 (id INT, prev_id INT, CONSTRAINT pk_00003 
>>>> PRIMARY KEY (id), CONSTRAINT fk_00003 FOREIGN KEY (prev_id) REFERENCES 
>>>> t00002(id));
>>>> -- etc, you get the idea
>>>>
>>>>
>>>> ... Yes. A corner case ;-) I can provide you with the full script, if 
>>>> you want.
>>>> 2. I've run the following query here. This query runs forever:
>>>>
>>>> SELECT "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FK_NAME",
>>>>   "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FKTABLE_NAME",
>>>>   "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FKTABLE_SCHEMA",
>>>>   "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FKCOLUMN_NAME",
>>>>   "INFORMATION_SCHEMA"."CONSTRAINTS"."CONSTRAINT_NAME",
>>>>   "INFORMATION_SCHEMA"."CONSTRAINTS"."CONSTRAINT_SCHEMA"
>>>> FROM "INFORMATION_SCHEMA"."CROSS_REFERENCES"
>>>> JOIN "INFORMATION_SCHEMA"."CONSTRAINTS"
>>>> ON ("INFORMATION_SCHEMA"."CROSS_REFERENCES"."PK_NAME"           = 
>>>> "INFORMATION_SCHEMA"."CONSTRAINTS"."UNIQUE_INDEX_NAME"
>>>> AND "INFORMATION_SCHEMA"."CROSS_REFERENCES"."PKTABLE_NAME"      = 
>>>> "INFORMATION_SCHEMA"."CONSTRAINTS"."TABLE_NAME"
>>>> AND "INFORMATION_SCHEMA"."CROSS_REFERENCES"."PKTABLE_SCHEMA"    = 
>>>> "INFORMATION_SCHEMA"."CONSTRAINTS"."TABLE_SCHEMA")
>>>> WHERE "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FKTABLE_SCHEMA" IN 
>>>> ('LARGE')
>>>> ORDER BY "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FKTABLE_SCHEMA" ASC,
>>>>   "INFORMATION_SCHEMA"."CROSS_REFERENCES"."FK_NAME" ASC,
>>>>   "INFORMATION_SCHEMA"."CROSS_REFERENCES"."ORDINAL_POSITION" ASC
>>>>
>>>>
>>>> 3. Profiling the above query seems to indicate, that the query results 
>>>> in a cartesian product, even if join criteria might be useful for 
>>>> optimisation:
>>>>
>>>> [image: Inline-Bild 1]
>>>>
>>>>
>>>> http://i.imgur.com/8gh6ld0.png
>>>>
>>>> The time column is not accurate, as it includes profiling side-effects. 
>>>> Also, the above data does not include the complete execution time.
>>>>
>>>> But as you can see, the invocation count goes up sky-high within 
>>>> generateRows(), which leads to a lot of overhead, mostly from within 
>>>> checkIndex(). Is there any way that such queries can be optimised  by 
>>>> adding more selective criteria to avoid cartesian products? Or is this a 
>>>> "bug" in H2's handling of INFORMATION_SCHEMA data?
>>>>
>>>> Also, did I understand this correctly: The INFORMATION_SCHEMA is 
>>>> implemented entirely in Java code, producing ad-hoc in-memory tables. 
>>>> There 
>>>> are no physical tables with actual indexes that can be queried in a much 
>>>> faster fashion...?
>>>>
>>>> Cheers
>>>> Lukas
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "H2 Database" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/h2-database?hl=en.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>  
>>>>  
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/h2-database?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Prohibitive amount of iterations in MetaTable.generateRows() when querying the INFORMATION_SCHEMA in a large database (many tables)

Reply via email to