Thank you for the responses fellas.

Solr is very fast, extremely flexible, can be deployed in a highly available manner, does not have complex requirements, is easy to configure and maintain and you guys own it. I am surprised that you are not already using it for all of the ASF's projects documentation and doubt Goodle's CSE would ever do the job as well as Solr could with just a little elbow grease behind it. You can make it whatever you want. It is after all open source, unlike the Google option. Why tie yourself to a vendor when you can keep complete control yourself, do a better job with very little resources, and never have to worry about the issues that come along with building a dependency on something outside of your control?

You have requirements outlined below for "critical services", some of them would obviously not apply if the function were out sourced but some would or at least should. Does Google provide an SLA for their free CSE? Yes, I know how silly that sounds, but work with me here. It's tough to compete with Google, I'm throwing out everything I have. ;)

I'll take a stab at approximations of answers to the questions below. If it ever comes to the point you need the information with greater precision, I will be glad to help with that.

Justin Erenkrantz wrote:
On Oct 8, 2007 11:51 AM, Vincent Bray <[EMAIL PROTECTED]> wrote:
I'm very much in favour of seeing how far we can take Solr as the
search mechanism for the httpd docs.

What are the production requirements for Solr?  IOW, what do we need
to run on www.apache.org to make this happen?  How much disk space?
How much RAM?  We do not currently run Java on our main web servers,
so running and maintaining it would have to be sorted out.  I don't
know if the Solr guys are even interested in helping us maintain a
local search engine.  (Previously, the Perl guys tried and gave up.)

Solr does not have to run on the web server hosting the search page. It can live on any server reachable by your web server that meets the requirements to run it The import/transform scripts also do not have to run on the same server as Solr, as they also submit documents to be indexes as a web request to Solr.

Solr needs Java 1.5 and an application server that supports the Servlet 2.4 standard. I used Jetty for my demo.

The import/transform script needs Perl with XML::Xpath and XML::Xpath::XMLParser, an XSLT tool such as Xalan or Xsltproc, curl, or the curl perl modules, and subversion to check out a copy of http-docs and the build stuff. There needs to be enough space to check out the docs and build files, plus temporary space for the transforms. ~80meg. The current Solr index with only the English version of the httpd documents is 1.7meg. Extrapolate that to account for the number of languages supported, 5 or 6? and we will call it 15meg.

The Solr application itself comes in under 12meg including the source and Jetty. I am not sure what Tomcat or other options would require but I will find out.

Sum that up and disk space wise we have around 30meg for a full language Solr install and an additional 80meg either local or decoupled for the documents, build files and temporary space for the transforms. 110 meg's not so bad. We'll say 200meg just to be safe and allow for some terse logging.

Currently running in Jetty and with a nice full query cache but idle, looks like this:
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
8947 arreyder  25   0  830m 102m  18m S    0  5.1   0:07.11 java

1 gig of ram should be comfortable, but the more the better for the sake of query caching.

I have not loaded it up yet to see how it looks during concurrent successive connections but am working on a test script to do just that. The test script will also be a great tool for preloading the cache. I will do this and report the results if anyone is still interested.

The ASF infrastructure team has a checklist of things that must be
satisfied before adding any new 'critical services' (which this falls
under).  See below for the current list.

So, I sort of think that just filling out a special account for a
'custom search engine' would be a *lot* less work.  =)  -- justin


You may be confusing work with fun and most of the fun has already been had by me getting it this far. Perhaps this work that you were speaking of is a hint that you would be the first to volunteer to help in getting my Solr implementation formally going? ;) I doubt you got to where you are by passing on good things because they required a little "work" and aren't you guys under some sort of eat-your-own-dog-food directive? If you do not use it, who will?

---
This provides a list of requirements and doctrines for web applications
that wish to be deployed on the Apache Infrastrcture.  It is intended to
help address many of the recurring issues we see with deployment and
maintainence of applications.

Definition of 'system': Any web application or site which will receive
traffic from public users in any manner.

Definition of 'critical systems': Any web application or site which runs
under www.apache.org, or is expected to receive a significant portion of
traffic.

1) All systems must be generally secure and robust. In cases of failure,
they should not damage the entire machine.


Since Solr is a service that is typically only called by another service it enjoys the security advantages of being at least once removed from the end user and never directly accessed by them. You could certainly provide rate limiting and other methods to keep load from ever reaching the point where it could impact other co located services. No real security or load management challenges here.

2) All systems must provide reliable backups, at least once a day, with
preference to incremental, real time or <1 hour snapshots.


Solr provides an easy method for off site replication via snapshots. This could be utilized for backups. Also it should be mentioned that on my low end core2 duo with 2gig of ram it only takes around 70 seconds to transform and index the complete English httpd documents from scratch. As long as you have the documents available for check out and the scripts to do it, you are never far from a freshly created index.

3) All systems must be maintainable by multiple active members of the
infrastructure team.


I am not a member but I am still happy to help. Any one else like to give me a hand? :)

4) All systems must come with a 'runbook' describing what to do in event
of failures, reboots, etc.  (If someone who has root needs to reboot the
box, what do they need to pay attention to?)


Again no real challenge here, I'd be happy to throw this together.

5) All systems must provide at least minimal monitoring via Nagios.

I'll write a plugin to do this or we can just use the check_http one already there. Depends on how deeply you want the service check to go.

6) All systems must be restorable and relocatable to other machines
without significant pain.


Replication of this configuration and packaging it is trivial. As I said before even if we have to re-index the docs, it just takes seconds. I'll build a package and deployment script.

7) All systems must have some kind of critical mass.  In general we do
not want to host one offs of any system.


"If you build it they will come." Did I mention I am from Iowa? We have this baseball diamond in a cornfield that you really should come see.


8) All system configuration files must be checked into Subversion.


Delighted to check in all 5 configuration files/scripts.

9) All system source must either be checked into Subversion, be at a
well-known public location, or is provided by the base OS.  (Hosting
binary-only webapps is a non-starter.)


Since Solr is an Apache project I am guessing you have this part already under control.

10) All systems, prior to consideration of deployment, must provide a
detailed performance impact analysis (bandwidth and CPU).  How are
techniques like HTTP caching used?  Lack of HTTP caching was MoinMoin's
initial PITA.


It does cache queries, and with mod_deflate out front bandwidth should be minimal. It's just text. I still need to get the details on CPU load and see how well it scales on a single machine. I'm working on it.

11) All systems must have clearly articulated, defined, and recorded
dependencies.


This is a very short list that I have for the most part already covered. Perl with XML::Xpath and XML::Xpath::XMLParser, an XSLT tool such as Xalan or Xsltproc, curl, or the curl perl modules, and subversion to check out a copy of http-docs and the build stuff. For Solr itself, Java 1.5 and Application Server that Supports 2.4 servelet standard.

12) All critical systems must be replicated across multiple machines,
with preference to cross-atlantic replication.


Not a problem. Solr has a multi server replication method using snapshots, rsync and hard links.

13) All systems must have single command operations to start, restart
and stop the system.  Support for init scripts used by the base
operating system is preferred.


You mean I need to do more than "nohup java -jar solr.jar &"? Sheesh! Seriously, since you are probably not planning on running this in Jetty, Tomcat, or whatever it lands on is probably already going to have that requirement met. If not, I'm on it.

I left out a few requirements as they are yet to be determined. I am not sure what kind of web front end you guys might want for both the query and the results so I cannot speak to the requirements on that end. The updating of documents in the Solr index is currently a manual process. It could be adjusted to either run at an interval via crontab, be actively initiated by the formal document builds, or configured to do svn diffs and import when it sees a change in a document it has been told to index. Lastly, I'm doing this Solr apache documents thing with or without you, you may as well take advantage of it. :)

chris rhodes
[EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to