We're already using Solr in our implementation (eg: http://me.edu.au/public/search?q=blog&category=BLOG or http://me.edu.au/public/search?q=blog&category=BLOG_ENTRY)
We have Solr deployed in an external web app, not inside Roller itself. We have an index request table, which contains index requests for inserts, updates & deletes. That table is built using database triggers, and read using a batch process which sends updates to Solr. I have two concerns with the linked proposal. Neither are deal breakers, though. 1) The lack of transactional guarantees with the combination of the database updates and the webservice call means that there is potential for inconsistency between the search index and the database. There are work arounds for that (keeping the transaction open and somehow passing the connection to the listeners so they can roll it back in the event of a problem), but they do have performance penalties. 2) The performance of Solr on updates is much better if you can batch your updates. We found that a batchsize of 20 (solr updates) gave up to 90% better performance than using single updates (even when commits are only sent periodically). I'd propose a slight modification of the proposal: 1) Anything which needs to be indexed also writes a row to an index request table 2) We use the listener as proposed to fire off an index request, which reads any unindexed rows from that table, sends those updates to Solr, and updates the table to show they have been successfully indexed. 3) On startup, we use similar code to index anything in that table which hasn't already been indexed. This proposal doesn't address my issue (2) above (although there are some easy optimizations to do that), but it does mean we'd avoid transactional problems. I'm happy to donate the code we have to this if people are interested (the fact that it relies on triggers will pretty much rule it out as it is as a general solution, but there might be some useful code there) Nick -----Original Message----- From: Dave [mailto:[EMAIL PROTECTED] Sent: Thursday, 11 December 2008 12:14 AM To: [email protected] Subject: Proposal - Clusterable Search Via Solr Roller uses Lucene for search and Lucene stores its search index on disk. So, if you have multiple Roller instances running you can either 1) have both Lucene instances write to the same disk file which will fail 2) have two inconsistent search indexes which will be extremely irritating or 3) turn off Roller's build-in search and use some external spider. For those who are not happy with those choices, I offer this proposal to use a) Apache Solr and b) some improvements to Roller's plug-in infrastructure to enable a cluster-able search implementation in Roller. http://cwiki.apache.org/confluence/x/_5kB Feedback is welcome. - Dave IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.
