Incubator PMC/Board report for Dec 2011 ([ppmc])

2011-12-01 Thread Marvin


Dear podling,

This email was sent by an automated system on behalf of the Apache Incubator 
PMC.
It is an initial reminder to give you plenty of time to prepare your quarterly
board report.

The board meeting is scheduled for Wed, 21 December 2011, 10:00:00 PST. The 
report 
for your podling will form a part of the Incubator PMC report. The Incubator 
PMC 
requires your report to be submitted 2 weeks before the board meeting, to allow 
sufficient time for review and submission (Wed, Dec 7th).

Please submit your report with sufficient time to allow the incubator PMC, and 
subsequently board members to review and digest. Again, the very latest you 
should submit your report is 2 weeks prior to the board meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report
--

Your report should contain the following:

 * Your project name
 * A brief description of your project, which assumes no knowledge of the 
project
   or necessarily of its field
 * A list of the three most important issues to address in the move towards 
   graduation.
 * Any issues that the Incubator PMC or ASF Board might wish/need to be aware of
 * How has the community developed since the last report
 * How has the project developed since the last report.
 
This should be appended to the Incubator Wiki page at:

  http://wiki.apache.org/incubator/December2011

Note: This manually populated. You may need to wait a little before this page is
  created from a template.

Mentors
---
Mentors should review reports for their project(s) and sign them off on the 
Incubator wiki page. Signing off reports shows that you are following the 
project - projects that are not signed may raise alarms for the Incubator PMC.

Incubator PMC



Re: Incubator PMC/Board report for Dec 2011 ([ppmc])

2011-12-01 Thread Karl Wright
Proposed text for this report:

ManifoldCF

--Description--

ManifoldCF is an incremental crawler framework and set of connectors
designed to pull documents from various kinds of repositories into
search engine indexes or other targets. The current bevy of repository
connectors includes Documentum (EMC), FileNet (IBM), LiveLink
(OpenText), Meridio (Autonomy), SharePoint (Microsoft), JDBC, CIFS
file systems, CMIS repositories, RSS feeds, and web content. Output
support includes Solr, MetaCarta GTS, and OpenSearchServer.
ManifoldCF also provides components for individual document security
within a target search engine, so that repository security access
conventions can be enforced in the search results.

ManifoldCF has been in incubation since January, 2010. It was
originally a planned subproject of Lucene but is now a likely
top-level project.

--A list of the three most important issues to address in the move
towards graduation--

1. We need at least one additional active committer, as well as
additional users and repeat contributors
2. We need the current committer base to each broaden their commits
to new areas of the project
3. We'd like to see long-term contributions for project testing,
especially infrastructure access

--Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to
be aware of?--

All issues have been addressed to our satisfaction at this time.

--How has the community developed since the last report?--

Talks on ManifoldCF were given both at Apache Eurocon 2011 in
Barcelona and at ApacheCon NA in Vancouver.  Both were well-received.
We have not signed up any new committers, however, this quarter,
although we've received interest from people towards contributing
especially in Vancouver.

Based on the feelings of our mentors, we have postponed further talk
of graduation, due to the still-not-well-distributed number of commits
among our committer base.  This is viewed as a blocker by one mentor
but not the other.

--How has the project developed since the last report?--

An 0.1 release was made on January 31, 2011, and a 0.2 release
occurred on May 17, 2011.  Another release occured on September
20, 2011.  A fourth release is planned for December 15, 2011, and will contain
significant new features, including full support for HSQLDB and a new
Alfresco connector.

Signed off by mentor:


Any comments, additions, improvements?

Karl

On Thu, Dec 1, 2011 at 7:55 AM, Marvin no-re...@apache.org wrote:


 Dear podling,

 This email was sent by an automated system on behalf of the Apache Incubator 
 PMC.
 It is an initial reminder to give you plenty of time to prepare your quarterly
 board report.

 The board meeting is scheduled for Wed, 21 December 2011, 10:00:00 PST. The 
 report
 for your podling will form a part of the Incubator PMC report. The Incubator 
 PMC
 requires your report to be submitted 2 weeks before the board meeting, to 
 allow
 sufficient time for review and submission (Wed, Dec 7th).

 Please submit your report with sufficient time to allow the incubator PMC, and
 subsequently board members to review and digest. Again, the very latest you
 should submit your report is 2 weeks prior to the board meeting.

 Thanks,

 The Apache Incubator PMC

 Submitting your Report
 --

 Your report should contain the following:

  * Your project name
  * A brief description of your project, which assumes no knowledge of the 
 project
   or necessarily of its field
  * A list of the three most important issues to address in the move towards
   graduation.
  * Any issues that the Incubator PMC or ASF Board might wish/need to be aware 
 of
  * How has the community developed since the last report
  * How has the project developed since the last report.

 This should be appended to the Incubator Wiki page at:

  http://wiki.apache.org/incubator/December2011

 Note: This manually populated. You may need to wait a little before this page 
 is
      created from a template.

 Mentors
 ---
 Mentors should review reports for their project(s) and sign them off on the
 Incubator wiki page. Signing off reports shows that you are following the
 project - projects that are not signed may raise alarms for the Incubator PMC.

 Incubator PMC



[jira] [Created] (CONNECTORS-299) Crawling using Postgresql 9.1 hangs occasionally for a while

2011-12-01 Thread Karl Wright (Created) (JIRA)
Crawling using Postgresql 9.1 hangs occasionally for a while


 Key: CONNECTORS-299
 URL: https://issues.apache.org/jira/browse/CONNECTORS-299
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


The hang takes place because of this postgresql error:

 message: ERROR: could not serialize access due to read/write dependencies 
among transactions
   Detail: Reason code: Canceled on identification as a pivot, during conflict 
in checking.
   Hint: The transaction might succeed if retried., - state: 40001

The state code 40001 should be interpreted as a deadlock so that the 
transaction is retried.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-299) Crawling using Postgresql 9.1 hangs occasionally for a while

2011-12-01 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-299:
---

Description: 
The hang takes place because of this postgresql error:

ERROR 2011-11-28 14:46:23,959 (Worker thread '25') - Worker thread aborting and 
restarting due to database connection reset: Database exception: Exception 
doing query: ERROR: could not serialize access due to read/write dependencies 
among transactions
 Detail: Reason code: Canceled on identification as a pivot, during commit 
attempt.
 Hint: The transaction might succeed if retried.
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: 
Exception doing query: ERROR: could not serialize access due to read/write 
dependencies among transactions
 Detail: Reason code: Canceled on identification as a pivot, during commit 
attempt.
 Hint: The transaction might succeed if retried.
   at 
org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:608)
   at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.commitCurrentTransaction(DBInterfacePostgreSQL.java:1141)
   at 
org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:321)
   at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
   at 
org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4072)
   at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:567)
Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize access 
due to read/write dependencies among transactions

This is happening during the commit phase of the transaction, at a point where 
none of the transaction blocks in ManifoldCF know to look for it. It is 
therefore likely that all transaction blocks throughout ManifoldCF that deal 
with deadlock will need to be changed.


  was:
The hang takes place because of this postgresql error:

 message: ERROR: could not serialize access due to read/write dependencies 
among transactions
   Detail: Reason code: Canceled on identification as a pivot, during conflict 
in checking.
   Hint: The transaction might succeed if retried., - state: 40001

The state code 40001 should be interpreted as a deadlock so that the 
transaction is retried.



 Crawling using Postgresql 9.1 hangs occasionally for a while
 

 Key: CONNECTORS-299
 URL: https://issues.apache.org/jira/browse/CONNECTORS-299
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The hang takes place because of this postgresql error:
 ERROR 2011-11-28 14:46:23,959 (Worker thread '25') - Worker thread aborting 
 and restarting due to database connection reset: Database exception: 
 Exception doing query: ERROR: could not serialize access due to read/write 
 dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
 exception: Exception doing query: ERROR: could not serialize access due to 
 read/write dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
at 
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:608)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.commitCurrentTransaction(DBInterfacePostgreSQL.java:1141)
at 
 org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:321)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
at 
 org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4072)
at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:567)
 Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize 
 access due to read/write dependencies among transactions
 This is happening during the commit phase of the transaction, at a point 
 where none of the transaction blocks in ManifoldCF know to look for it. It is 
 therefore likely that all transaction blocks throughout ManifoldCF that deal 
 with deadlock will need to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

  

[jira] [Commented] (CONNECTORS-299) Crawling using Postgresql 9.1 hangs occasionally for a while

2011-12-01 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161079#comment-13161079
 ] 

Karl Wright commented on CONNECTORS-299:


r1209209 for infrastructure and deployment of that infrastructure in the place 
we see in the exception listed above.  Obviously many more places need to be 
worked on before the ticket is in any sense complete.


 Crawling using Postgresql 9.1 hangs occasionally for a while
 

 Key: CONNECTORS-299
 URL: https://issues.apache.org/jira/browse/CONNECTORS-299
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The hang takes place because of this postgresql error:
 ERROR 2011-11-28 14:46:23,959 (Worker thread '25') - Worker thread aborting 
 and restarting due to database connection reset: Database exception: 
 Exception doing query: ERROR: could not serialize access due to read/write 
 dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
 exception: Exception doing query: ERROR: could not serialize access due to 
 read/write dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
at 
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:608)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.commitCurrentTransaction(DBInterfacePostgreSQL.java:1141)
at 
 org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:321)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
at 
 org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4072)
at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:567)
 Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize 
 access due to read/write dependencies among transactions
 This is happening during the commit phase of the transaction, at a point 
 where none of the transaction blocks in ManifoldCF know to look for it. It is 
 therefore likely that all transaction blocks throughout ManifoldCF that deal 
 with deadlock will need to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

2011-12-01 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161139#comment-13161139
 ] 

Karl Wright commented on CONNECTORS-275:


bq. I'm assuming Method A is closer to what you described:
bq.
bq. Method A:
bq.
bq.* (while on mysite.com/session-timeout-message.html which has a link to 
login.cgi)
bq.  3: (same as above, matching timeout-msg.html) Tell MCF to Fetch: 
http://mysite.com/login.cgi
bq.  4: (new, matching login.cgi) Tell MCF that the form name is ^$, and 
that the parameters are username=me and password=hello.

Yes, exactly.

bq. The only issue here is that, since there is no form on login.cgi, there's 
no method=GET to inherit from.

So the link from the timeout page sends you to login.cgi without any parameters 
at all, and yet login.cgi requires parameters to perform the login?  Or (I've 
seen this done before) when you go to http://mysite.com/login.cgi, do you get 
the form at that time, which when submitted goes right back to login.cgi, but 
this time with the GET form data?  If the former, we'd need a new type of login 
page.  If the latter, we could make it work with the current software.

bq. If more code needs to be written, I wasn't necessarily bugging you to write 
it (though you'd be faster at it!)

Let's think it through first and then see.  Usually in cases like this I create 
a branch so that we can do multiple commits and not have to put everything in a 
single (massive) patch.  This also means we can both work on the code.

Adding a new login page type is not that challenging technically, just a bit of 
work in the UI mostly.  But how would that new login page actually work?  
Should it match the URL regexp only, or should there be some other identifying 
characteristic on the page itself?  And, since there's no form to submit, and 
there are three different ways to submit a form in HTML, it seems to me that 
we'd want to basically specify a virtual form, consisting of everything you 
might find on a normal form: the form type, the action URL, an a complete set 
of name/value pairs to be transmitted to the action target.  Does this sound 
right?






 Clarify documentation as to how to set up session login for web connector
 -

 Key: CONNECTORS-275
 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation, Web connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright

 A book reader has this comment, which basically implies that we need to 
 improve the documentation for the web connector:
 I was excited to get the full version of the online book, but then 
 disappointed when it referred back to the online doc for setting up logins 
 for a Web spidering. The online doc is very vague and only gives one example. 
 I've used Ultraseek's and Google's spider, but I still find the Session login 
 sequences non-obvious.
 I've got a subscription request into the user mailing list, but here's the 
 parts that are not clear.
 I generally understand about using regexes to define sites and sorting out 
 content pages from login pages.
 But it's not clear why there's TWO Regex's per entry. There's a Login URL 
 regex, and also a Form name/link target regex.
 It's also not clear about the page type radio button choices.
 For rediection, am I saying look for a redirect event, or am I saying 
 then DO a redirect to this page.
 And for form name, what if my login page doesn't have a named form? In the 
 case of the site I'm trying to spider, when your session expires, you 
 manually go back to an https page and supply your username and password as 
 CGI parameters. I know this sounds odd, but it's apparently how a number of 
 the sites we're trying to spider work, some proprietary software.
 Karl, I really think the book or Wiki or doc needs 3 or 4 different examples 
 of login scenarios.
 Here's the scenario I'm trying, if you'd like to use it:
 Try to fetch: http://site.com/product?id=1234
 If you get a redirect to: http://site.com/Main.asp
 Note that there's no login form nor link on this page.
 Then invoke this login URL: 
 https://site.com/validate?username=mepassword=thatotherArg=something
 Note that you can't just visit this page and fill in a form, that gives an 
 error, it has to be passed in (I think as a GET)
 Then record the session cookie and try for /product?id=1234 again.
 I realize this is odd, I didn't design it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

 

[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

2011-12-01 Thread Mark Bennett (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161170#comment-13161170
 ] 

Mark Bennett commented on CONNECTORS-275:
-

 So the link from the timeout page sends you to login.cgi without any 
 parameters at all, and yet login.cgi requires parameters to perform the login?

I believe so, need to verify.  8 different sites = 8 slightly different 
behaviors.

 Or (I've seen this done before) when you go to http://mysite.com/login.cgi, 
 do you get the form at that time...

I wish!  At least on some sites, no.

And worse(!) on some sites, if you just go to login.cgi with no parms, you get 
a nasty error, like maybe a 500.

So that'd be another problem - to tell MCF to ignore even severe errors (so 
that we can have the 2 step rule)

 But how would that new login page actually work? Should it match the URL 
 regexp only, or should there be some other identifying characteristic on the 
 page itself?

Not sure I'm directly answering this

But this might be where my habits with other spiders are different enough than 
MCF's that maybe there's implicit unlearn *that*! in my near future.

I'd classify MCF as a reactive pattern matcher.  It can do almost anything 
based on what it gets back.

Whereas I was thinking more proactive IF you see url-A THEN GOTO 
arbitrary-url-B, where the ONLY place literal url-B exists is in the config 
screen.  In that scenario, where I can inject arbitrary new URLs via 
configuration, then to me it looks easy.

In that scenario (arbitrary config injection) we solve all the problems at 
once.  A URL with ? arg=value  arg=value IS a GET, so no config there.  And I 
get to specify the args inline, in the URL.

This is inelegant as a general solution.  I can enumerate a few right here: 
What if it needed to be a POST after all?  What if my parameters are long and 
have spaces and need URL encoding - then I'd have to encode them manually.  
Editing 1.5k URLs in a 3 inch HTML web form is UGLY.  And what if I didn't know 
the exact URL, but I could calculate it based on some other state?

MCF's model handles all those other items in a much more general, re-usable 
way.  Whereas the special case of I just need it to fetch this arbitrary 200 
character URL almost seems like a degenerate use case which coincidently has 
an easy fix.  And my only response to that, arguing both sides of the coin 
here, is that this might be a much more common edge case than a software 
architect might assume.

Do the last few paragraphs make sense?  And did it answer your question?

BTW Karl, this is probably the most detailed (and to me interesting) 
conversation I've had with anybody about the minutia of URLs and logins in a 
while.  Normally I'd coral an engineer in front of a whiteboard, but this is 
more like how they used to play chess, via US Mail, kinda fun!




 Clarify documentation as to how to set up session login for web connector
 -

 Key: CONNECTORS-275
 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation, Web connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright

 A book reader has this comment, which basically implies that we need to 
 improve the documentation for the web connector:
 I was excited to get the full version of the online book, but then 
 disappointed when it referred back to the online doc for setting up logins 
 for a Web spidering. The online doc is very vague and only gives one example. 
 I've used Ultraseek's and Google's spider, but I still find the Session login 
 sequences non-obvious.
 I've got a subscription request into the user mailing list, but here's the 
 parts that are not clear.
 I generally understand about using regexes to define sites and sorting out 
 content pages from login pages.
 But it's not clear why there's TWO Regex's per entry. There's a Login URL 
 regex, and also a Form name/link target regex.
 It's also not clear about the page type radio button choices.
 For rediection, am I saying look for a redirect event, or am I saying 
 then DO a redirect to this page.
 And for form name, what if my login page doesn't have a named form? In the 
 case of the site I'm trying to spider, when your session expires, you 
 manually go back to an https page and supply your username and password as 
 CGI parameters. I know this sounds odd, but it's apparently how a number of 
 the sites we're trying to spider work, some proprietary software.
 Karl, I really think the book or Wiki or doc needs 3 or 4 different examples 
 of login scenarios.
 Here's the scenario I'm trying, if you'd like to use it:
 Try to fetch: http://site.com/product?id=1234
 If you get a redirect to: 

[jira] [Resolved] (CONNECTORS-299) Crawling using Postgresql 9.1 hangs occasionally for a while

2011-12-01 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-299.


Resolution: Fixed

 Crawling using Postgresql 9.1 hangs occasionally for a while
 

 Key: CONNECTORS-299
 URL: https://issues.apache.org/jira/browse/CONNECTORS-299
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The hang takes place because of this postgresql error:
 ERROR 2011-11-28 14:46:23,959 (Worker thread '25') - Worker thread aborting 
 and restarting due to database connection reset: Database exception: 
 Exception doing query: ERROR: could not serialize access due to read/write 
 dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
 exception: Exception doing query: ERROR: could not serialize access due to 
 read/write dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during commit 
 attempt.
  Hint: The transaction might succeed if retried.
at 
 org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:608)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.commitCurrentTransaction(DBInterfacePostgreSQL.java:1141)
at 
 org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:321)
at 
 org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
at 
 org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:4072)
at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:567)
 Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize 
 access due to read/write dependencies among transactions
 This is happening during the commit phase of the transaction, at a point 
 where none of the transaction blocks in ManifoldCF know to look for it. It is 
 therefore likely that all transaction blocks throughout ManifoldCF that deal 
 with deadlock will need to be changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-275) Clarify documentation as to how to set up session login for web connector

2011-12-01 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161315#comment-13161315
 ] 

Karl Wright commented on CONNECTORS-275:


r1209313 to improve web connector end user documentation.


 Clarify documentation as to how to set up session login for web connector
 -

 Key: CONNECTORS-275
 URL: https://issues.apache.org/jira/browse/CONNECTORS-275
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation, Web connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright

 A book reader has this comment, which basically implies that we need to 
 improve the documentation for the web connector:
 I was excited to get the full version of the online book, but then 
 disappointed when it referred back to the online doc for setting up logins 
 for a Web spidering. The online doc is very vague and only gives one example. 
 I've used Ultraseek's and Google's spider, but I still find the Session login 
 sequences non-obvious.
 I've got a subscription request into the user mailing list, but here's the 
 parts that are not clear.
 I generally understand about using regexes to define sites and sorting out 
 content pages from login pages.
 But it's not clear why there's TWO Regex's per entry. There's a Login URL 
 regex, and also a Form name/link target regex.
 It's also not clear about the page type radio button choices.
 For rediection, am I saying look for a redirect event, or am I saying 
 then DO a redirect to this page.
 And for form name, what if my login page doesn't have a named form? In the 
 case of the site I'm trying to spider, when your session expires, you 
 manually go back to an https page and supply your username and password as 
 CGI parameters. I know this sounds odd, but it's apparently how a number of 
 the sites we're trying to spider work, some proprietary software.
 Karl, I really think the book or Wiki or doc needs 3 or 4 different examples 
 of login scenarios.
 Here's the scenario I'm trying, if you'd like to use it:
 Try to fetch: http://site.com/product?id=1234
 If you get a redirect to: http://site.com/Main.asp
 Note that there's no login form nor link on this page.
 Then invoke this login URL: 
 https://site.com/validate?username=mepassword=thatotherArg=something
 Note that you can't just visit this page and fill in a form, that gives an 
 error, it has to be passed in (I think as a GET)
 Then record the session cookie and try for /product?id=1234 again.
 I realize this is odd, I didn't design it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira