Hi guys,

 

I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting
unmodified content applying it to Nutch 0.8.1. Here are some points:

 

1.    This feature is great for Nutch to have has it differentiate between
modified and unmodified content, therefore not indexing twice even if the
document fetch time has arrived.

a.    There are some performance issues here. Even with this patch, Nutch
still fetches the content and then checks its status against the last
modified time in the database. If it has to check for a 1000 files before
indexing the following 10 files, this will cause a real problem for those
that are after real time indexing.

 

2.    Since, I applied this patch to Nutch 0.8.1, when I try to parse xml
files with our modified version of the xmlparser /indexer plugin; the
fetcher throws the following exception:

 

WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

 

The system will not hang or crash but the xml file will be indexed without
any generated fields. The plugins works fine without the patch. I have
another parser that parses graphics and other formats that fails when used
with the patch. So far this problem occurs when using the file protocol.

 

3.    the patch works fine when indexing web site using the http protocol.

 

I am willing to work with Andrzej to make it stable as I understand it's the
architect of this patch. I have the possibility of testing it in a mix
environment in our computer lab. This patch can be the stepping stone for
other features such real time indexing and fetch queue for index updating as
opposed to creating a new index each time.

 

Best Regards,

 

Armel

 

-------------------------------------------------

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

http://blog.idna-solutions.com

-----Original Message-----
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: 17 January 2007 15:39
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch release

 

Sami Siren wrote:

> 2007/1/17, Enis Soztutar <[EMAIL PROTECTED]>:

>> 

>> Hi all, for NUTCH-251:

>> 

>> I suppose that NUTCH-251 is relatively a significant issue by the votes.

>> Stafan has written a good plugin for the admin gui and i have updated it

>> to work with nutch-0.8, hadoop 0.4.

> 

> 

> Good to hear someone is working on that! Why not target it to

> trunk version of Nutch?

It is targetted to the trunk already. The previous was targetted to 

nutch-0.8, hadoop 0.4, since back then that versions was the latest in 

the trunk

> 

>>  - a web server to serve plugin jsp's

> 

> Why not make it regular war? also please consider making a clean

> separation of view/logic when you implement the web ui.

As Stafan's version used embedded Jetty server, I continued this way. 

But i will consider that possibility also.

 

> 

> -- 

> Sami Siren

> 

 

 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to