Hi guys,
I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting unmodified content applying it to Nutch 0.8.1. Here are some points: 1. This feature is great for Nutch to have has it differentiate between modified and unmodified content, therefore not indexing twice even if the document fetch time has arrived. a. There are some performance issues here. Even with this patch, Nutch still fetches the content and then checks its status against the last modified time in the database. If it has to check for a 1000 files before indexing the following 10 files, this will cause a real problem for those that are after real time indexing. 2. Since, I applied this patch to Nutch 0.8.1, when I try to parse xml files with our modified version of the xmlparser /indexer plugin; the fetcher throws the following exception: WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200): java.lang.IllegalStateException: Root element not set The system will not hang or crash but the xml file will be indexed without any generated fields. The plugins works fine without the patch. I have another parser that parses graphics and other formats that fails when used with the patch. So far this problem occurs when using the file protocol. 3. the patch works fine when indexing web site using the http protocol. I am willing to work with Andrzej to make it stable as I understand it's the architect of this patch. I have the possibility of testing it in a mix environment in our computer lab. This patch can be the stepping stone for other features such real time indexing and fetch queue for index updating as opposed to creating a new index each time. Best Regards, Armel ------------------------------------------------- Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -----Original Message----- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: 17 January 2007 15:39 To: [email protected] Subject: Re: Next Nutch release Sami Siren wrote: > 2007/1/17, Enis Soztutar <[EMAIL PROTECTED]>: >> >> Hi all, for NUTCH-251: >> >> I suppose that NUTCH-251 is relatively a significant issue by the votes. >> Stafan has written a good plugin for the admin gui and i have updated it >> to work with nutch-0.8, hadoop 0.4. > > > Good to hear someone is working on that! Why not target it to > trunk version of Nutch? It is targetted to the trunk already. The previous was targetted to nutch-0.8, hadoop 0.4, since back then that versions was the latest in the trunk > >> - a web server to serve plugin jsp's > > Why not make it regular war? also please consider making a clean > separation of view/logic when you implement the web ui. As Stafan's version used embedded Jetty server, I continued this way. But i will consider that possibility also. > > -- > Sami Siren >
