Re: Performance Problems Continue

Karl Wright Wed, 17 Nov 2010 01:02:03 -0800

I updated the "how to update the site" page to be correct once again.
Here's the link:


https://cwiki.apache.org/confluence/display/CONNECTORS/Updating+the+Website

Thanks again for all your help - it's been very very valuable!

Karl

On Wed, Nov 17, 2010 at 3:20 AM, Karl Wright <daddy...@gmail.com> wrote:
> Your contributions are more than welcome.  But in order to submit them
> for inclusion in anything other than the wiki, you have to grant
> rights to ASF to them.  The best way to do that is to do the
> following:
>
> (a) Open a ticket in JIRA (https://issues.apache.org/jira) relating to
> crawl performance
> (b) Do your writeup and attach it to your ticket either as a patch for
> the http site (sources in svn under trunk/site now), or just as a text
> file
> (c) When you attach it, make sure you select the radio button or
> checkbox that grants rights to Apache Software Foundation
>
> If you want to actually modify the site, I'd suggest doing a whole new
> page on performance.  The site is built with Apache Forrest, and you
> will need to built the trunk version of it for it to work properly.
> If you use the shipping 0.8 version it is still good enough to check
> most things, but it fails to do images properly.  Documents in the
> site are under trunk/site/src/documentation/content/xdocs, and are
> Forrest xml documents.  There are instructions in the wiki for
> updating the site, but they include a step of copying the built site
> to a publish area and checking it in.  You will, of course, not be
> able to do that, and it's no longer correct anyhow.  You might as well
> attach your whole new xml page as the "patch", and I can take it from
> there.
>
> I'll take it from there.  If you simply (say) edit the wiki faq page
> then yes, your comments need to stay in the wiki forever and cannot be
> moved to any shipping publication.  But if you add these to the site
> they will ship with the product as well, which would be extremely
> helpful to people!
>
> Thanks,
> Karl
>
> On Wed, Nov 17, 2010 at 12:19 AM, Farzad Valad <ho...@farzad.net> wrote:
>> Well... It worked!  Version 8.4.5 is it.  I ripped through 306944 items at
>> 57 docs/sec (1:29:26).  There are two other systems in my test environment,
>> one with solid state drives.  I'll upgrade those and let you know the final
>> results.  Perhaps we can publish these results including yours on the
>> manifoldcf site.  A question I have what should I increase to improve more,
>> there was room in both the CPU and disk activities.
>>
>> Also noticed from another email chain, talks about the user comments in the
>> old wiki.  I believe I contributed a lot to the user comments mentioned.
>> There are some good faq items, I can compile a list for review and
>> inclusion, then you can discard the rest of the comments.  Would that help
>> solve that problem?
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddy...@gmail.com]
>> Sent: Tuesday, November 16, 2010 8:43 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: Performance Problems Continue
>>
>> I've done a number of runs, added documents willy-nilly, and brought
>> the document count up to about 50,000.  I am still seeing reasonable
>> plans for this query:
>>
>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>> EXISTS(
>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>> ('1C33AEE63094C42C9537A9D94E8
>> CB2B903764190') AND t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>> t1.
>> jobid=1289847478234 AND t1.isnew='B');
>>                                                                QUERY PLAN
>>
>> ----------------------------------------------------------------------------
>> ---------------------------------------------------------------
>>  Nested Loop Semi Join  (cost=0.00..16.63 rows=1 width=127)
>>   Join Filter: ((t0.dochash)::text = (t1.childidhash)::text)
>>   ->  Index Scan using i1289954442565 on jobqueue t0
>> (cost=0.00..8.35 rows=1 width=127)
>>         Index Cond: (jobid = 1289847478234::bigint)
>>   ->  Index Scan using i1289954442546 on carrydown t1
>> (cost=0.00..8.27 rows=1 width=98)
>>         Index Cond: ((t1.jobid = 1289847478234::bigint) AND
>> ((t1.parentidhash)::text =
>> '1C33AEE63094C42C9537A9D94E8CB2B903764190'::text))
>>         Filter: (t1.isnew = 'B'::bpchar)
>> (7 rows)
>>
>>
>> So I suggest you upgrade to 8.4.4, and let me know what happens then.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Nov 16, 2010 at 8:15 PM, Farzad Valad <ho...@farzad.net> wrote:
>>> Sounds perfect!
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:daddy...@gmail.com]
>>> Sent: Tuesday, November 16, 2010 5:12 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: Re: Performance Problems Continue
>>>
>>> Turns out my memory is incorrect - the postgresql version that I've
>>> been using is 8.4.4, not 8.3.9.
>>> So there are a lot of possibilities here.
>>>
>>> What I'm going to try is to install the latest 8.4.5 on a different
>>> system, and do a large crawl.  Then I'll see if the plan is OK.  If it
>>> is OK, then I'll ask you to upgrade to that version as well, and see
>>> how you do.
>>>
>>> Does this sound reasonable?
>>> Karl
>>>
>>>
>>> On Tue, Nov 16, 2010 at 4:23 PM, Karl Wright <daddy...@gmail.com> wrote:
>>>> I'm concerned that if I install postgresql 8.3.12, and do not
>>>> reproduce your problem, we'll have learned nothing.
>>>> I'll look around to see if I have the 8.3.9 installer still around...
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Nov 16, 2010 at 2:37 PM,  <ho...@farzad.net> wrote:
>>>>> Can't find the windows binaries on the web.  I found this link, but it
>>>>> keeps saying file not found regardless of the mirror server I use.
>>>>>
>>>>>
>>>
>> http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/pos
>>> tgresql-8.3.9-1.zip
>>>>>
>>>>> Seems like the fastest path would be for you to go up to .12, especially
>>>>> that is the version anyone can get their hands on.  Do you have the
>>>>> windows binaries or installer?
>>>>>
>>>>> Thanks,Farzad.
>>>>>
>>>>>
>>>>> Hi Farzad,
>>>>> The index it should be using but isn't is:
>>>>>
>>>>> public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
>>>>> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid)
>>>>>
>>>>> Both the dochash value and the jobid value are specified for the
>>>>> jobqueue table in that query, so there's no reason it should not use
>>>>> this index.  It's either failed to consider it at all, or it has
>>>>> concluded that it's more efficient to scan the entire jobqueue table
>>>>> rather than using the index for the lookup.
>>>>>
>>>>> FWIW, I did run some explains on similar queries on my test system
>>>>> this morning before I left for work, and the plans for those involved
>>>>> something called an "index semi-join" which apparently means it scans
>>>>> both the jobqueue index and the carrydown index at the same time.  If
>>>>> I recall correctly, my version of PostgreSQL was 8.3.9.  Since yours
>>>>> is 8.3.12 it could well be that something was broken in PostgreSQL
>>>>> between releases.  If you have time, you might try to see if an
>>>>> earlier version of postgresql exhibits the same behavior for you.  If
>>>>> that works properly, the next step would be to create a ticket against
>>>>> PostgreSQL itself - or maybe there already is one.
>>>>>
>>>>> I expect that any of the many PostgreSQL books out there would have
>>>>> enough information in it to bring you up to speed.  The skill set for
>>>>> optimizing queries is also portable to other databases such as Oracle,
>>>>> if you care.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Nov 16, 2010 at 12:03 PM,  <ho...@farzad.net> wrote:
>>>>>> Found a query statement to pull the index data.  Is this what you are
>>>>>> looking for?  On a separate note, I'm a bit over my head in terms of
>>>>>> understanding, any recommendation of a site/sites or book to read to
>>>>>> catch
>>>>>> up with needed db concepts?
>>>>>>
>>>>>> dbname=> select * from pg_indexes where tablename='jobqueue';
>>>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>>>              indexdef
>>>>>>
>>>
>> ------------+-----------+----------------+------------+---------------------
>>> ----
>>>>>> -------------------------------------------------------------
>>>>>>  public     | jobqueue  | jobqueue_pkey  |            | CREATE UNIQUE
>>>>>> INDEX jobq
>>>>>> ueue_pkey ON jobqueue USING btree (id)
>>>>>>  public     | jobqueue  | i1289847375565 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 565 ON jobqueue USING btree (jobid, status)
>>>>>>  public     | jobqueue  | i1289847375564 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 564 ON jobqueue USING btree (jobid, isseed)
>>>>>>  public     | jobqueue  | i1289847375563 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 563 ON jobqueue USING btree (jobid, dochash, status)
>>>>>>  public     | jobqueue  | i1289847375562 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 562 ON jobqueue USING btree (status)
>>>>>>  public     | jobqueue  | i1289847375561 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 561 ON jobqueue USING btree (checkaction, checktime, status)
>>>>>>  public     | jobqueue  | i1289847375560 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 560 ON jobqueue USING btree (priorityset, status)
>>>>>>  public     | jobqueue  | i1289847375559 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 559 ON jobqueue USING btree (docpriority)
>>>>>>  public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
>>>>>> INDEX i128
>>>>>> 9847375558 ON jobqueue USING btree (dochash, jobid)
>>>>>> (9 rows)
>>>>>>
>>>>>> dbname=> select * from pg_indexes where tablename='carrydown';
>>>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>>>                               indexdef
>>>>>>
>>>>>>
>>>
>> ------------+-----------+----------------+------------+---------------------
>>> ----
>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>> ----
>>>>>> ----------------
>>>>>>  public     | carrydown | i1289847375547 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 547 ON carrydown USING btree (jobid, parentidhash)
>>>>>>  public     | carrydown | i1289847375546 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 546 ON carrydown USING btree (jobid, childidhash, dataname)
>>>>>>  public     | carrydown | i1289847375545 |            | CREATE INDEX
>>>>>> i1289847375
>>>>>> 545 ON carrydown USING btree (jobid, childidhash, isnew)
>>>>>>  public     | carrydown | i1289847375544 |            | CREATE UNIQUE
>>>>>> INDEX i128
>>>>>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash,
>>>>>> dataname,
>>>>>>  datavaluehash)
>>>>>> (4 rows)
>>>>>>
>>>>>> dbname=>
>>>>>>
>>>>>> Farzad.
>>>>>>
>>>>>>
>>>>>> psql usually displays the indexes for a table when you dt the table.
>>>>>> These are not appearing in your postgresql version.  But it's clear
>>>>>> they are there from the plan.
>>>>>>
>>>>>> The jobqueue table is being sequentially scanned *despite* the
>>>>>> probable fact that there's a reasonable, usable index on it that it
>>>>>> should be using.  In short, this looks like a PostgreSQL bug to me.
>>>>>>
>>>>>> I'm away from my test machine right now, but I'd love to see the
>>>>>> indexes on the jobqueue table that you have.  There may be a new
>>>>>> command in psql that my version of PostgreSQL doesn't have that prints
>>>>>> the indexes for a table, if they took it out of dt.
>>>>>>
>>>>>> If the proper index is there, then all we can do is try various forms
>>>>>> of the query to see if we can dodge the PostgreSQL bug.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <ho...@farzad.net> wrote:
>>>>>>> Here is the output:
>>>>>>>
>>>>>>>
>>>>>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf dbname
>>>>>>> Password for user manifoldcf:
>>>>>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal.
>>>>>>>
>>>>>>> Type:  \copyright for distribution terms
>>>>>>>       \h for help with SQL commands
>>>>>>>       \? for help with psql commands
>>>>>>>       \g or terminate with semicolon to execute query
>>>>>>>       \q to quit
>>>>>>>
>>>>>>> Warning: Console code page (437) differs from Windows code page (1252)
>>>>>>>         8-bit characters might not work correctly. See psql reference
>>>>>>>         page "Notes for Windows users" for details.
>>>>>>>
>>>>>>> dbname=> \dt jobqueue;
>>>>>>>           List of relations
>>>>>>>  Schema |   Name   | Type  |   Owner
>>>>>>> --------+----------+-------+------------
>>>>>>>  public | jobqueue | table | manifoldcf
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> \dt carrydown;
>>>>>>>            List of relations
>>>>>>>  Schema |   Name    | Type  |   Owner
>>>>>>> --------+-----------+-------+------------
>>>>>>>  public | carrydown | table | manifoldcf
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> select count(*) from carrydown;
>>>>>>>  count
>>>>>>> -------
>>>>>>>     0
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> select distinct jobid from carrydown;
>>>>>>>  jobid
>>>>>>> -------
>>>>>>> (0 rows)
>>>>>>>
>>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>>>> WHERE
>>>>>>> EXISTS(
>>>>>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.doch
>>>>>>> ash AND t0.jobid=1289847478234 AND
>>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>     QUERY PLAN
>>>>>>>
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>> --------------------
>>>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2000133.07 rows=118994
>> width=158)
>>>>>>>   Filter: (subplan)
>>>>>>>   SubPlan
>>>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>>>> (cost=0.00..8.28
>>>>>>>  rows=1 width=0)
>>>>>>>                 Index Cond: ((jobid = 1289847478234::bigint) AND
>>>>>>> ((parentidhash
>>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::
>>>>>>> text = ($0)::text))
>>>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>>>> (8 rows)
>>>>>>>
>>>>>>> dbname=> analyze carrydown;
>>>>>>> ANALYZE
>>>>>>> dbname=> analyze jobqueue;
>>>>>>> ANALYZE
>>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>>>> WHERE
>>>>>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>     QUERY PLAN
>>>>>>>
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>> --------------------
>>>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2091946.86 rows=124532
>> width=158)
>>>>>>>   Filter: (subplan)
>>>>>>>   SubPlan
>>>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>>>> (cost=0.00..8.28
>>>>>>>  rows=1 width=0)
>>>>>>>                 Index Cond: ((jobid = 1289847478234::bigint) AND
>>>>>>> ((parentidhash
>>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::
>>>>>>> text = ($0)::text))
>>>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>>>> (8 rows)
>>>>>>>
>>>>>>> dbname=>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Karl Wright [mailto:daddy...@gmail.com]
>>>>>>> Sent: Tuesday, November 16, 2010 4:21 AM
>>>>>>> To: connectors-user@incubator.apache.org
>>>>>>> Subject: Re: Performance Problems Continue
>>>>>>>
>>>>>>> As I suspected, the plan it is generating is crap:
>>>>>>>
>>>>>>> Found a query that took more than a minute (62494 ms): [SELECT
>>>>>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x'
>>>>>>> FROM carrydown t1 WHERE parentidhash IN (?) AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND
>>>>>>> t1.isnew=?)]
>>>>>>>  Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F'
>>>>>>>  Parameter 1: '1289847478234'
>>>>>>>  Parameter 2: '1289847478234'
>>>>>>>  Parameter 3: 'B'
>>>>>>>  Plan: Seq Scan on jobqueue t0  (cost=0.00..1332448.02 rows=79465
>>>>>>> width=152)
>>>>>>>  Plan:   Filter: (subplan)
>>>>>>>  Plan:   SubPlan
>>>>>>>  Plan:     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>  Plan:           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>  Plan:           ->  Index Scan using i1289847375544 on carrydown t1
>>>>>>> (cost=0.00..8.28 rows=1 width=0)
>>>>>>>  Plan:                 Index Cond: ((jobid = 1289847478234::bigint)
>>>>>>> AND ((parentidhash)::text =
>>>>>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::text = ($0)::text))
>>>>>>>  Plan:                 Filter: (isnew = 'B'::bpchar)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ... which means that it is doing a sequential scan on the jobqueue
>>>>>>> table (large) and an index scan for each row on the carrydown table
>>>>>>> (which should be empty), rather than the other way round.  There is a
>>>>>>> carrydown index that is should be using instead, which it is
>>>>>>> completely ignoring, which is all the more weird.
>>>>>>>
>>>>>>> Can you do the following:
>>>>>>>
>>>>>>> (1) Under the postgresql bin directory, run psql -U manifoldcf dbname
>>>>>>> (2) If the defaults are in place, password should be "local_pg_passwd"
>>>>>>> (3) Run the following commands, and cut/paste the answers for me:
>>>>>>>
>>>>>>> \dt jobqueue;
>>>>>>> \dt carrydown;
>>>>>>> select count(*) from carrydown;
>>>>>>> select distinct jobid from carrydown;
>>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>> Once you have done that, please do the following:
>>>>>>>
>>>>>>> analyze carrydown;
>>>>>>> analyze jobqueue;
>>>>>>>
>>>>>>> Then, try the explain again. and send me that output too:
>>>>>>>
>>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>
>>>>>>> I've tried the same commands on my system, but they show reasonable
>>>>>>> plans.  PostgreSQL, like all databases, develops its plans based on
>>>>>>> statistics it keeps for the tables in question.  The analyze command
>>>>>>> tells it to update those statistics, which in theory should make
>>>>>>> PostgreSQL immediately thereafter use a correct plan.  However, there
>>>>>>> may be great sensitivity to incorrect statistics for some sorts of
>>>>>>> query, depending on the conditions, so a small number of accumulated
>>>>>>> changes can cause it to start to make huge errors.  I am trying to
>>>>>>> determine if that's what is happening, and what those conditions are
>>>>>>> in your case.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <ho...@farzad.net>
>> wrote:
>>>>>>>> 1. Using 8.3.12, gave up on 9.x
>>>>>>>> 2. Yes, I do.  Attached the log file for this crawl
>>>>>>>> 3. Using Null Output and File System input, not using RSS
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Performance Problems Continue

Reply via email to