Re: Using Solr to index and search the Apache HTTPD Documents

chris Mon, 08 Oct 2007 21:31:32 -0700

Thank you for the responses fellas.

Solr is very fast, extremely flexible, can be deployed in a highlyavailable manner, does not have complex requirements, is easy toconfigure and maintain and you guys own it. I am surprised that you arenot already using it for all of the ASF's projects documentation anddoubt Goodle's CSE would ever do the job as well as Solr could with justa little elbow grease behind it. You can make it whatever you want. Itis after all open source, unlike the Google option. Why tie yourself toa vendor when you can keep complete control yourself, do a better jobwith very little resources, and never have to worry about the issuesthat come along with building a dependency on something outside of yourcontrol?

You have requirements outlined below for "critical services", some ofthem would obviously not apply if the function were out sourced but somewould or at least should. Does Google provide an SLA for their freeCSE? Yes, I know how silly that sounds, but work with me here. It'stough to compete with Google, I'm throwing out everything I have. ;)

I'll take a stab at approximations of answers to the questions below.If it ever comes to the point you need the information with greaterprecision, I will be glad to help with that.


Justin Erenkrantz wrote:

On Oct 8, 2007 11:51 AM, Vincent Bray <[EMAIL PROTECTED]> wrote:

I'm very much in favour of seeing how far we can take Solr as the
search mechanism for the httpd docs.


What are the production requirements for Solr?  IOW, what do we need
to run on www.apache.org to make this happen?  How much disk space?
How much RAM?  We do not currently run Java on our main web servers,
so running and maintaining it would have to be sorted out.  I don't
know if the Solr guys are even interested in helping us maintain a
local search engine.  (Previously, the Perl guys tried and gave up.)

Solr does not have to run on the web server hosting the search page. Itcan live on any server reachable by your web server that meets therequirements to run it The import/transform scripts also do not have torun on the same server as Solr, as they also submit documents to beindexes as a web request to Solr.

Solr needs Java 1.5 and an application server that supports the Servlet2.4 standard. I used Jetty for my demo.

The import/transform script needs Perl with XML::Xpath andXML::Xpath::XMLParser, an XSLT tool such as Xalan or Xsltproc, curl, orthe curl perl modules, and subversion to check out a copy of http-docsand the build stuff. There needs to be enough space to check out thedocs and build files, plus temporary space for the transforms. ~80meg.The current Solr index with only the English version of the httpddocuments is 1.7meg. Extrapolate that to account for the number oflanguages supported, 5 or 6? and we will call it 15meg.

The Solr application itself comes in under 12meg including the sourceand Jetty. I am not sure what Tomcat or other options would require butI will find out.

Sum that up and disk space wise we have around 30meg for a full languageSolr install and an additional 80meg either local or decoupled for thedocuments, build files and temporary space for the transforms. 110 meg'snot so bad. We'll say 200meg just to be safe and allow for some terselogging.

Currently running in Jetty and with a nice full query cache but idle,looks like this:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
8947 arreyder  25   0  830m 102m  18m S    0  5.1   0:07.11 java

1 gig of ram should be comfortable, but the more the better for the sakeof query caching.

I have not loaded it up yet to see how it looks during concurrentsuccessive connections but am working on a test script to do just that.The test script will also be a great tool for preloading the cache. Iwill do this and report the results if anyone is still interested.

The ASF infrastructure team has a checklist of things that must be
satisfied before adding any new 'critical services' (which this falls
under).  See below for the current list.

So, I sort of think that just filling out a special account for a
'custom search engine' would be a *lot* less work.  =)  -- justin

You may be confusing work with fun and most of the fun has already beenhad by me getting it this far. Perhaps this work that you were speakingof is a hint that you would be the first to volunteer to help in gettingmy Solr implementation formally going? ;) I doubt you got to where youare by passing on good things because they required a little "work" andaren't you guys under some sort of eat-your-own-dog-food directive? Ifyou do not use it, who will?

---
This provides a list of requirements and doctrines for web applications
that wish to be deployed on the Apache Infrastrcture.  It is intended to
help address many of the recurring issues we see with deployment and
maintainence of applications.

Definition of 'system': Any web application or site which will receive
traffic from public users in any manner.

Definition of 'critical systems': Any web application or site which runs
under www.apache.org, or is expected to receive a significant portion of
traffic.

1) All systems must be generally secure and robust. In cases of failure,
they should not damage the entire machine.

Since Solr is a service that is typically only called by another serviceit enjoys the security advantages of being at least once removed fromthe end user and never directly accessed by them. You could certainlyprovide rate limiting and other methods to keep load from ever reachingthe point where it could impact other co located services. No realsecurity or load management challenges here.

2) All systems must provide reliable backups, at least once a day, with
preference to incremental, real time or <1 hour snapshots.

Solr provides an easy method for off site replication via snapshots.This could be utilized for backups. Also it should be mentioned that onmy low end core2 duo with 2gig of ram it only takes around 70 seconds totransform and index the complete English httpd documents from scratch.As long as you have the documents available for check out and thescripts to do it, you are never far from a freshly created index.

3) All systems must be maintainable by multiple active members of the
infrastructure team.

I am not a member but I am still happy to help. Any one else like togive me a hand? :)

4) All systems must come with a 'runbook' describing what to do in event
of failures, reboots, etc.  (If someone who has root needs to reboot the
box, what do they need to pay attention to?)


Again no real challenge here, I'd be happy to throw this together.

5) All systems must provide at least minimal monitoring via Nagios.

I'll write a plugin to do this or we can just use the check_http onealready there. Depends on how deeply you want the service check to go.

6) All systems must be restorable and relocatable to other machines
without significant pain.

Replication of this configuration and packaging it is trivial. As Isaid before even if we have to re-index the docs, it just takesseconds. I'll build a package and deployment script.

7) All systems must have some kind of critical mass.  In general we do
not want to host one offs of any system.

"If you build it they will come." Did I mention I am from Iowa? Wehave this baseball diamond in a cornfield that you really should come see.

8) All system configuration files must be checked into Subversion.


Delighted to check in all 5 configuration files/scripts.

9) All system source must either be checked into Subversion, be at a
well-known public location, or is provided by the base OS.  (Hosting
binary-only webapps is a non-starter.)

Since Solr is an Apache project I am guessing you have this part alreadyunder control.

10) All systems, prior to consideration of deployment, must provide a
detailed performance impact analysis (bandwidth and CPU).  How are
techniques like HTTP caching used?  Lack of HTTP caching was MoinMoin's
initial PITA.

It does cache queries, and with mod_deflate out front bandwidth shouldbe minimal. It's just text. I still need to get the details on CPUload and see how well it scales on a single machine. I'm working on it.

11) All systems must have clearly articulated, defined, and recorded
dependencies.

This is a very short list that I have for the most part alreadycovered. Perl with XML::Xpath and XML::Xpath::XMLParser, an XSLT toolsuch as Xalan or Xsltproc, curl, or the curl perl modules, andsubversion to check out a copy of http-docs and the build stuff. ForSolr itself, Java 1.5 and Application Server that Supports 2.4 serveletstandard.

12) All critical systems must be replicated across multiple machines,
with preference to cross-atlantic replication.

Not a problem. Solr has a multi server replication method usingsnapshots, rsync and hard links.

13) All systems must have single command operations to start, restart
and stop the system.  Support for init scripts used by the base
operating system is preferred.

You mean I need to do more than "nohup java -jar solr.jar &"? Sheesh!Seriously, since you are probably not planning on running this in Jetty,Tomcat, or whatever it lands on is probably already going to have thatrequirement met. If not, I'm on it.

I left out a few requirements as they are yet to be determined. I amnot sure what kind of web front end you guys might want for both thequery and the results so I cannot speak to the requirements on thatend. The updating of documents in the Solr index is currently a manualprocess. It could be adjusted to either run at an interval via crontab,be actively initiated by the formal document builds, or configured to dosvn diffs and import when it sees a change in a document it has beentold to index.Lastly, I'm doing this Solr apache documents thing with or without you,you may as well take advantage of it. :)


chris rhodes
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Solr to index and search the Apache HTTPD Documents

Reply via email to