FWIW, the ticket is CONNECTORS-96. I've created a branch to work on it. I'll let you know when I think it's ready to try out.
Karl On Mon, May 7, 2012 at 5:53 AM, Karl Wright <daddy...@gmail.com> wrote: > Also, there has been a long-running ticket to replace the JDBC pool > driver with something more modern for a while. Many of the > off-the-shelf pool drivers are inadequate for various reasons, so I > have one that I wrote myself, but it is not yet committed. So I am > curious - which connections are timing out? The Oracle connections or > the Postgresql ones? > > Karl > > On Mon, May 7, 2012 at 5:34 AM, Karl Wright <daddy...@gmail.com> wrote: >> What database are you using? (Not the JDBC database, the underlying >> one...) If PostgreSQL, what version? What version of ManifoldCF? If >> you could also post some of the long-running queries, that would be >> good as well. >> >> Depending on the database, ManifoldCF periodically >> re-analyzes/reindexes the underlying database during the crawl, which >> when the table is large can cause some warnings about long-running >> queries, because during the reindex process the database performance >> is slowed. That's not usually a problem, other than briefly slowing >> the crawl. However, it's also possible that there's a point where >> Postgresql's plan is poor, and we should see that because the warning >> also dumps the plan. >> >> Truncating the jobqueue table is not recommended, since then >> ManifoldCF has no idea of what it has crawled and what it hasn't, and >> its incremental properties tend to suffer. >> >> Karl >> >> >> On Mon, May 7, 2012 at 1:25 AM, Michael Le <michael.aaron...@gmail.com> >> wrote: >>> Hello, >>> >>> Using a JDBC Repository connection to an Oracle 11g database, I've had >>> issues where in the initial seeding stage the connection to the database is >>> closing in the middle of processing the result set. The original data table >>> I'm trying to index is about 10 million records, and with the original code, >>> I could never get past about 750K records. >>> >>> I spent some time with the pooling parameters to the bitmachanic database >>> pooling, but the API and source doesn't seem to be available any more. Even >>> the original author doesn't have the code or specs any more. The parameter >>> modifications to the pool allowed me to get through the first stage of >>> processing a 2M row subset, but during the second stage where it's trying to >>> obtain the documents, the connections again started being closed. I ended >>> up just replacing the connection pool code, with an oracle implementation, >>> and its churning through the documents happily. As a foot note, on my >>> sample subset of about 400K documents, the throughput went from about 10 >>> documents/s to 19 docs/s, but this may just be a side effect of oracle >>> database load or network traffic. >>> >>> Has anyone else had issues processing a large Oracle repository? I've noted >>> the benchmarks were done with 300K documents, and even in our initial >>> testing with about 500K documents, no issues arose. >>> >>> The second and more pressing issue is the jobqueues table. In the process >>> of dubugging the database connection issues, jobs were started, stopped, >>> deleted, aborted, and various WHERE clauses were applied to the seeding >>> queries/jobs. MCF is now reporting that there are long running queries >>> against this table. In the past, I've just truncated the jobqueues table, >>> but this had the side effect of stuffing a document into solr (output >>> connector) multiple times. What API calls, or sql can I run to clean up the >>> jobqueues table? Should I just wait for all jobs to finish and then at that >>> point truncate the table? I've broken my data into several smaller subsets >>> of around 1-2 million rows, but that has the side effect of a jobqueues >>> table that is 6-8 million rows. >>> >>> Any support would be greatly appreciated. >>> >>> Thanks, >>> -Michael Le >>> >>>