Awesome! This looks very interesting - I'll give it a look over the next few weeks....
-Mark -----Original Message----- From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] Sent: 17 March 2010 13:59 To: nutch-user@lucene.apache.org Subject: Announcing release of Arch - an extension of Nutch for intranet search Hello, I have been reading this list for quite a while. This was frustrating at times because very often I thought, "If only I could release Arch now, I could help this..., and this..., and this..." But, it was not ready. Now it is ready and I am more than happy to release it. I hope it will be useful in more than one way. A few examples: - People often asked how to avoid a complete re-crawl when a crawl fails. With Arch, you can do it. You can split your web site into areas and crawl them separately as needed. Then they are combined into a single index. If a crawl fails and you restart Arch, it will start with the area that failed, skipping already indexed ones. - People asked how to use Nutch classes from Java. Arch is doing that, see the sources. - People had issues with updating pages in the index. Arch does not have this problem. Arch has a lot more than the above. For me, as a webmaster, it has everything that I can ask for: document level security, easy support for multiple web sites, modular pluggable authentication, automatic dynamic site directory, scheduled cheap index updates. A very important feature is improved document weighting scheme. It works fantastic on intranets. No more users' complains about finding junk instead of what they expect to find. Arch has a dual (PHP and JSP) interface. For those of you that prefer PHP to Java, the PHP interface will be easier to customise. More information, sources, screenshots and binaries are available here: http://www.atnf.csiro.au/computing/software/arch/index.html Sorry, no demo is available, as Arch runs behind the firewall at ATNF. I hope to get it out in the open in a few days. Regards, Arkadi Kosmynin CSIRO Astronomy and Space Science