I updated the "how to update the site" page to be correct once again. Here's the link:
https://cwiki.apache.org/confluence/display/CONNECTORS/Updating+the+Website Thanks again for all your help - it's been very very valuable! Karl On Wed, Nov 17, 2010 at 3:20 AM, Karl Wright <daddy...@gmail.com> wrote: > Your contributions are more than welcome. But in order to submit them > for inclusion in anything other than the wiki, you have to grant > rights to ASF to them. The best way to do that is to do the > following: > > (a) Open a ticket in JIRA (https://issues.apache.org/jira) relating to > crawl performance > (b) Do your writeup and attach it to your ticket either as a patch for > the http site (sources in svn under trunk/site now), or just as a text > file > (c) When you attach it, make sure you select the radio button or > checkbox that grants rights to Apache Software Foundation > > If you want to actually modify the site, I'd suggest doing a whole new > page on performance. The site is built with Apache Forrest, and you > will need to built the trunk version of it for it to work properly. > If you use the shipping 0.8 version it is still good enough to check > most things, but it fails to do images properly. Documents in the > site are under trunk/site/src/documentation/content/xdocs, and are > Forrest xml documents. There are instructions in the wiki for > updating the site, but they include a step of copying the built site > to a publish area and checking it in. You will, of course, not be > able to do that, and it's no longer correct anyhow. You might as well > attach your whole new xml page as the "patch", and I can take it from > there. > > I'll take it from there. If you simply (say) edit the wiki faq page > then yes, your comments need to stay in the wiki forever and cannot be > moved to any shipping publication. But if you add these to the site > they will ship with the product as well, which would be extremely > helpful to people! > > Thanks, > Karl > > On Wed, Nov 17, 2010 at 12:19 AM, Farzad Valad <ho...@farzad.net> wrote: >> Well... It worked! Version 8.4.5 is it. I ripped through 306944 items at >> 57 docs/sec (1:29:26). There are two other systems in my test environment, >> one with solid state drives. I'll upgrade those and let you know the final >> results. Perhaps we can publish these results including yours on the >> manifoldcf site. A question I have what should I increase to improve more, >> there was room in both the CPU and disk activities. >> >> Also noticed from another email chain, talks about the user comments in the >> old wiki. I believe I contributed a lot to the user comments mentioned. >> There are some good faq items, I can compile a list for review and >> inclusion, then you can discard the rest of the comments. Would that help >> solve that problem? >> >> -----Original Message----- >> From: Karl Wright [mailto:daddy...@gmail.com] >> Sent: Tuesday, November 16, 2010 8:43 PM >> To: connectors-user@incubator.apache.org >> Subject: Re: Performance Problems Continue >> >> I've done a number of runs, added documents willy-nilly, and brought >> the document count up to about 50,000. I am still seeing reasonable >> plans for this query: >> >> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE >> EXISTS( >> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN >> ('1C33AEE63094C42C9537A9D94E8 >> CB2B903764190') AND t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND >> t1. >> jobid=1289847478234 AND t1.isnew='B'); >> QUERY PLAN >> >> ---------------------------------------------------------------------------- >> --------------------------------------------------------------- >> Nested Loop Semi Join (cost=0.00..16.63 rows=1 width=127) >> Join Filter: ((t0.dochash)::text = (t1.childidhash)::text) >> -> Index Scan using i1289954442565 on jobqueue t0 >> (cost=0.00..8.35 rows=1 width=127) >> Index Cond: (jobid = 1289847478234::bigint) >> -> Index Scan using i1289954442546 on carrydown t1 >> (cost=0.00..8.27 rows=1 width=98) >> Index Cond: ((t1.jobid = 1289847478234::bigint) AND >> ((t1.parentidhash)::text = >> '1C33AEE63094C42C9537A9D94E8CB2B903764190'::text)) >> Filter: (t1.isnew = 'B'::bpchar) >> (7 rows) >> >> >> So I suggest you upgrade to 8.4.4, and let me know what happens then. >> >> Thanks, >> Karl >> >> >> On Tue, Nov 16, 2010 at 8:15 PM, Farzad Valad <ho...@farzad.net> wrote: >>> Sounds perfect! >>> >>> -----Original Message----- >>> From: Karl Wright [mailto:daddy...@gmail.com] >>> Sent: Tuesday, November 16, 2010 5:12 PM >>> To: connectors-user@incubator.apache.org >>> Subject: Re: Performance Problems Continue >>> >>> Turns out my memory is incorrect - the postgresql version that I've >>> been using is 8.4.4, not 8.3.9. >>> So there are a lot of possibilities here. >>> >>> What I'm going to try is to install the latest 8.4.5 on a different >>> system, and do a large crawl. Then I'll see if the plan is OK. If it >>> is OK, then I'll ask you to upgrade to that version as well, and see >>> how you do. >>> >>> Does this sound reasonable? >>> Karl >>> >>> >>> On Tue, Nov 16, 2010 at 4:23 PM, Karl Wright <daddy...@gmail.com> wrote: >>>> I'm concerned that if I install postgresql 8.3.12, and do not >>>> reproduce your problem, we'll have learned nothing. >>>> I'll look around to see if I have the 8.3.9 installer still around... >>>> >>>> Karl >>>> >>>> >>>> On Tue, Nov 16, 2010 at 2:37 PM, <ho...@farzad.net> wrote: >>>>> Can't find the windows binaries on the web. I found this link, but it >>>>> keeps saying file not found regardless of the mirror server I use. >>>>> >>>>> >>> >> http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/pos >>> tgresql-8.3.9-1.zip >>>>> >>>>> Seems like the fastest path would be for you to go up to .12, especially >>>>> that is the version anyone can get their hands on. Do you have the >>>>> windows binaries or installer? >>>>> >>>>> Thanks,Farzad. >>>>> >>>>> >>>>> Hi Farzad, >>>>> The index it should be using but isn't is: >>>>> >>>>> public | jobqueue | i1289847375558 | | CREATE UNIQUE >>>>> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid) >>>>> >>>>> Both the dochash value and the jobid value are specified for the >>>>> jobqueue table in that query, so there's no reason it should not use >>>>> this index. It's either failed to consider it at all, or it has >>>>> concluded that it's more efficient to scan the entire jobqueue table >>>>> rather than using the index for the lookup. >>>>> >>>>> FWIW, I did run some explains on similar queries on my test system >>>>> this morning before I left for work, and the plans for those involved >>>>> something called an "index semi-join" which apparently means it scans >>>>> both the jobqueue index and the carrydown index at the same time. If >>>>> I recall correctly, my version of PostgreSQL was 8.3.9. Since yours >>>>> is 8.3.12 it could well be that something was broken in PostgreSQL >>>>> between releases. If you have time, you might try to see if an >>>>> earlier version of postgresql exhibits the same behavior for you. If >>>>> that works properly, the next step would be to create a ticket against >>>>> PostgreSQL itself - or maybe there already is one. >>>>> >>>>> I expect that any of the many PostgreSQL books out there would have >>>>> enough information in it to bring you up to speed. The skill set for >>>>> optimizing queries is also portable to other databases such as Oracle, >>>>> if you care. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Nov 16, 2010 at 12:03 PM, <ho...@farzad.net> wrote: >>>>>> Found a query statement to pull the index data. Is this what you are >>>>>> looking for? On a separate note, I'm a bit over my head in terms of >>>>>> understanding, any recommendation of a site/sites or book to read to >>>>>> catch >>>>>> up with needed db concepts? >>>>>> >>>>>> dbname=> select * from pg_indexes where tablename='jobqueue'; >>>>>> schemaname | tablename | indexname | tablespace | >>>>>> indexdef >>>>>> >>> >> ------------+-----------+----------------+------------+--------------------- >>> ---- >>>>>> ------------------------------------------------------------- >>>>>> public | jobqueue | jobqueue_pkey | | CREATE UNIQUE >>>>>> INDEX jobq >>>>>> ueue_pkey ON jobqueue USING btree (id) >>>>>> public | jobqueue | i1289847375565 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 565 ON jobqueue USING btree (jobid, status) >>>>>> public | jobqueue | i1289847375564 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 564 ON jobqueue USING btree (jobid, isseed) >>>>>> public | jobqueue | i1289847375563 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 563 ON jobqueue USING btree (jobid, dochash, status) >>>>>> public | jobqueue | i1289847375562 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 562 ON jobqueue USING btree (status) >>>>>> public | jobqueue | i1289847375561 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 561 ON jobqueue USING btree (checkaction, checktime, status) >>>>>> public | jobqueue | i1289847375560 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 560 ON jobqueue USING btree (priorityset, status) >>>>>> public | jobqueue | i1289847375559 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 559 ON jobqueue USING btree (docpriority) >>>>>> public | jobqueue | i1289847375558 | | CREATE UNIQUE >>>>>> INDEX i128 >>>>>> 9847375558 ON jobqueue USING btree (dochash, jobid) >>>>>> (9 rows) >>>>>> >>>>>> dbname=> select * from pg_indexes where tablename='carrydown'; >>>>>> schemaname | tablename | indexname | tablespace | >>>>>> indexdef >>>>>> >>>>>> >>> >> ------------+-----------+----------------+------------+--------------------- >>> ---- >>>>>> >>> >> ---------------------------------------------------------------------------- >>> ---- >>>>>> ---------------- >>>>>> public | carrydown | i1289847375547 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 547 ON carrydown USING btree (jobid, parentidhash) >>>>>> public | carrydown | i1289847375546 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 546 ON carrydown USING btree (jobid, childidhash, dataname) >>>>>> public | carrydown | i1289847375545 | | CREATE INDEX >>>>>> i1289847375 >>>>>> 545 ON carrydown USING btree (jobid, childidhash, isnew) >>>>>> public | carrydown | i1289847375544 | | CREATE UNIQUE >>>>>> INDEX i128 >>>>>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash, >>>>>> dataname, >>>>>> datavaluehash) >>>>>> (4 rows) >>>>>> >>>>>> dbname=> >>>>>> >>>>>> Farzad. >>>>>> >>>>>> >>>>>> psql usually displays the indexes for a table when you dt the table. >>>>>> These are not appearing in your postgresql version. But it's clear >>>>>> they are there from the plan. >>>>>> >>>>>> The jobqueue table is being sequentially scanned *despite* the >>>>>> probable fact that there's a reasonable, usable index on it that it >>>>>> should be using. In short, this looks like a PostgreSQL bug to me. >>>>>> >>>>>> I'm away from my test machine right now, but I'd love to see the >>>>>> indexes on the jobqueue table that you have. There may be a new >>>>>> command in psql that my version of PostgreSQL doesn't have that prints >>>>>> the indexes for a table, if they took it out of dt. >>>>>> >>>>>> If the proper index is there, then all we can do is try various forms >>>>>> of the query to see if we can dodge the PostgreSQL bug. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <ho...@farzad.net> wrote: >>>>>>> Here is the output: >>>>>>> >>>>>>> >>>>>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf dbname >>>>>>> Password for user manifoldcf: >>>>>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal. >>>>>>> >>>>>>> Type: \copyright for distribution terms >>>>>>> \h for help with SQL commands >>>>>>> \? for help with psql commands >>>>>>> \g or terminate with semicolon to execute query >>>>>>> \q to quit >>>>>>> >>>>>>> Warning: Console code page (437) differs from Windows code page (1252) >>>>>>> 8-bit characters might not work correctly. See psql reference >>>>>>> page "Notes for Windows users" for details. >>>>>>> >>>>>>> dbname=> \dt jobqueue; >>>>>>> List of relations >>>>>>> Schema | Name | Type | Owner >>>>>>> --------+----------+-------+------------ >>>>>>> public | jobqueue | table | manifoldcf >>>>>>> (1 row) >>>>>>> >>>>>>> dbname=> \dt carrydown; >>>>>>> List of relations >>>>>>> Schema | Name | Type | Owner >>>>>>> --------+-----------+-------+------------ >>>>>>> public | carrydown | table | manifoldcf >>>>>>> (1 row) >>>>>>> >>>>>>> dbname=> select count(*) from carrydown; >>>>>>> count >>>>>>> ------- >>>>>>> 0 >>>>>>> (1 row) >>>>>>> >>>>>>> dbname=> select distinct jobid from carrydown; >>>>>>> jobid >>>>>>> ------- >>>>>>> (0 rows) >>>>>>> >>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 >>>>>>> WHERE >>>>>>> EXISTS( >>>>>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN >>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND >>>>>>> t1.childidhash=t0.doch >>>>>>> ash AND t0.jobid=1289847478234 AND >>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B'); >>>>>>> >>>>>>> QUERY PLAN >>>>>>> >>>>>>> >>> >> ---------------------------------------------------------------------------- >>>>>>> ---- >>>>>>> >>> >> ---------------------------------------------------------------------------- >>>>>>> ---- >>>>>>> -------------------- >>>>>>> Seq Scan on jobqueue t0 (cost=0.00..2000133.07 rows=118994 >> width=158) >>>>>>> Filter: (subplan) >>>>>>> SubPlan >>>>>>> -> Result (cost=0.00..8.28 rows=1 width=0) >>>>>>> One-Time Filter: ($1 = 1289847478234::bigint) >>>>>>> -> Index Scan using i1289847375544 on carrydown t1 >>>>>>> (cost=0.00..8.28 >>>>>>> rows=1 width=0) >>>>>>> Index Cond: ((jobid = 1289847478234::bigint) AND >>>>>>> ((parentidhash >>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND >>>>>>> ((childidhash):: >>>>>>> text = ($0)::text)) >>>>>>> Filter: (isnew = 'B'::bpchar) >>>>>>> (8 rows) >>>>>>> >>>>>>> dbname=> analyze carrydown; >>>>>>> ANALYZE >>>>>>> dbname=> analyze jobqueue; >>>>>>> ANALYZE >>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 >>>>>>> WHERE >>>>>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN >>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND >>>>>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND >>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B'); >>>>>>> >>>>>>> QUERY PLAN >>>>>>> >>>>>>> >>> >> ---------------------------------------------------------------------------- >>>>>>> ---- >>>>>>> >>> >> ---------------------------------------------------------------------------- >>>>>>> ---- >>>>>>> -------------------- >>>>>>> Seq Scan on jobqueue t0 (cost=0.00..2091946.86 rows=124532 >> width=158) >>>>>>> Filter: (subplan) >>>>>>> SubPlan >>>>>>> -> Result (cost=0.00..8.28 rows=1 width=0) >>>>>>> One-Time Filter: ($1 = 1289847478234::bigint) >>>>>>> -> Index Scan using i1289847375544 on carrydown t1 >>>>>>> (cost=0.00..8.28 >>>>>>> rows=1 width=0) >>>>>>> Index Cond: ((jobid = 1289847478234::bigint) AND >>>>>>> ((parentidhash >>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND >>>>>>> ((childidhash):: >>>>>>> text = ($0)::text)) >>>>>>> Filter: (isnew = 'B'::bpchar) >>>>>>> (8 rows) >>>>>>> >>>>>>> dbname=> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Karl Wright [mailto:daddy...@gmail.com] >>>>>>> Sent: Tuesday, November 16, 2010 4:21 AM >>>>>>> To: connectors-user@incubator.apache.org >>>>>>> Subject: Re: Performance Problems Continue >>>>>>> >>>>>>> As I suspected, the plan it is generating is crap: >>>>>>> >>>>>>> Found a query that took more than a minute (62494 ms): [SELECT >>>>>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' >>>>>>> FROM carrydown t1 WHERE parentidhash IN (?) AND >>>>>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND >>>>>>> t1.isnew=?)] >>>>>>> Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F' >>>>>>> Parameter 1: '1289847478234' >>>>>>> Parameter 2: '1289847478234' >>>>>>> Parameter 3: 'B' >>>>>>> Plan: Seq Scan on jobqueue t0 (cost=0.00..1332448.02 rows=79465 >>>>>>> width=152) >>>>>>> Plan: Filter: (subplan) >>>>>>> Plan: SubPlan >>>>>>> Plan: -> Result (cost=0.00..8.28 rows=1 width=0) >>>>>>> Plan: One-Time Filter: ($1 = 1289847478234::bigint) >>>>>>> Plan: -> Index Scan using i1289847375544 on carrydown t1 >>>>>>> (cost=0.00..8.28 rows=1 width=0) >>>>>>> Plan: Index Cond: ((jobid = 1289847478234::bigint) >>>>>>> AND ((parentidhash)::text = >>>>>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND >>>>>>> ((childidhash)::text = ($0)::text)) >>>>>>> Plan: Filter: (isnew = 'B'::bpchar) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ... which means that it is doing a sequential scan on the jobqueue >>>>>>> table (large) and an index scan for each row on the carrydown table >>>>>>> (which should be empty), rather than the other way round. There is a >>>>>>> carrydown index that is should be using instead, which it is >>>>>>> completely ignoring, which is all the more weird. >>>>>>> >>>>>>> Can you do the following: >>>>>>> >>>>>>> (1) Under the postgresql bin directory, run psql -U manifoldcf dbname >>>>>>> (2) If the defaults are in place, password should be "local_pg_passwd" >>>>>>> (3) Run the following commands, and cut/paste the answers for me: >>>>>>> >>>>>>> \dt jobqueue; >>>>>>> \dt carrydown; >>>>>>> select count(*) from carrydown; >>>>>>> select distinct jobid from carrydown; >>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE >>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN >>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND >>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND >>>>>>> t1.jobid=1289847478234 AND t1.isnew='B'); >>>>>>> >>>>>>> Once you have done that, please do the following: >>>>>>> >>>>>>> analyze carrydown; >>>>>>> analyze jobqueue; >>>>>>> >>>>>>> Then, try the explain again. and send me that output too: >>>>>>> >>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE >>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN >>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND >>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND >>>>>>> t1.jobid=1289847478234 AND t1.isnew='B'); >>>>>>> >>>>>>> >>>>>>> I've tried the same commands on my system, but they show reasonable >>>>>>> plans. PostgreSQL, like all databases, develops its plans based on >>>>>>> statistics it keeps for the tables in question. The analyze command >>>>>>> tells it to update those statistics, which in theory should make >>>>>>> PostgreSQL immediately thereafter use a correct plan. However, there >>>>>>> may be great sensitivity to incorrect statistics for some sorts of >>>>>>> query, depending on the conditions, so a small number of accumulated >>>>>>> changes can cause it to start to make huge errors. I am trying to >>>>>>> determine if that's what is happening, and what those conditions are >>>>>>> in your case. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <ho...@farzad.net> >> wrote: >>>>>>>> 1. Using 8.3.12, gave up on 9.x >>>>>>>> 2. Yes, I do. Attached the log file for this crawl >>>>>>>> 3. Using Null Output and File System input, not using RSS >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> >> >