The suggestion of populating the tables via the CREATE TABLE ... AS SELECT 
DISTINCT is a good one so far.  It's nearly a 30% performance improvement 
compared to the previous method, which cuts about 7 whole days off of the 
expected time to complete populating the database.

I'm currently testing to determine if it will still experience the 
out-of-memory exception, as I am still performing one WHERE ... NOT IN 
query.  It's going to take all night, so I'll check back in the morning.  
If it does fail, I have some ideas for how to work around the problem.

I also wanted to ask about performance differences across database modes.  
Right now I'm constructing the database in file mode and just storing it on 
the same machine that's running the application.  I can't set up an actual 
server to offload the database maintenance onto a second machine, but I 
might be able to set up an actual MySQL instance on this machine and 
connect to it.  Would I reasonably expect to see a performance improvement 
by doing this?


On Monday, October 7, 2019 at 8:47:53 AM UTC-4, Tim Fielder wrote:
>
> O(n log n) sounds about right.  Based on the most recent results I get 
> this 
> <https://www.wolframalpha.com/input/?i=quadratic+fit+%7B23%2C908.85%7D%2C%7B69%2C+3584.09%7D%2C%7B115%2C7153.94%7D>
>  
> formula for the expected time to process X files of 10,000 articles each.  
> With an x^2 term of 0.21 that's just above linear, but once you're up to 
> over 4,600 files it's still way too damn long.
>
> I will try the CREATE TABLE ... AS SELECT DISTINCT thing and get back to 
> you.  Thanks.
>
> On Monday, October 7, 2019 at 7:40:01 AM UTC-4, Noel Grandin wrote:
>>
>>
>>
>> On 2019/10/07 1:32 PM, Tim Fielder wrote:> The problem with this approach 
>> is that since the tables are indexed, the 
>> insert time grows quadratically with the size 
>>  > of the table.  As a result I can handle 230,000 articles in about 2 
>> hours, but the full 46.7 million will take at least 
>>  > 300 days. 
>>
>> That should not be the case, insert time should be something like O(n log 
>> n) 
>> So not sure why it is so slow for you. 
>>
>>
>>  > 
>>  > In order to defer the application of constraints until after I fully 
>> complete parsing, the schema becomes simply:> 
>>
>> If you are going to do something like this, then rather 
>>
>> (*) insert all rows into tempdoc 
>>
>> (*) CREATE TABLE document AS SELECT DISTINCT .... FROM tempdoc 
>>
>> (*) add constraints to document 
>>
>> and similarly for other tables. 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/63284233-56b8-4bc0-9e65-7c0073954cd6%40googlegroups.com.

Reply via email to