FYI, I have not used DeleteDuplicate as in our system the concept of
duplicates is a little different. But if it works within the context of the
Nutch bean, you should be fine, If not (ie. Is it called as an external
program) look below

You could get away with using it as the Segment still exists. The problem
now is that if the index actaully needs to get the summary for a hit from
that segment and does not find it, it will throw a null pointer and you'll
see nothing. Again, this is simple to fix by changing the Nutch code --
catch the exception and simply skip over that one.

Though, to get stuff from getting corrupted, I would suggest doing all
external manipulation offline -- from my understanding, currently nutch
supports reading/wrting from one JVM, so if you can delete withing that
context, it's fine -- external manipulation should be external.

This *may* change in the future. See the map and reduce email exchanges over
the last day or so. There is talk about allowing multiple crawlers to write
to one segment.

Here's a simple strategy that we initially used when we did not use multiple
computers and replication -- run the external tools to figure out the
deletions, but instead of deleting write to a log file. Then call another
program that marks them deletable in a safe manner.


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Michael
Sashnikov
Sent: Saturday, January 15, 2005 10:35 PM
To: [EMAIL PROTECTED]
Subject: RE: [Nutch-dev] SegmentMergeTool without stopping Tomcat?

Chirag,

Thanks a lot for your fast response. It was very helpful.

In case I decide to delete duplicates using DeleteDuplicates instead of
SegmentMergeTool should it be done externally as well?

Thanks,
Michael


>From: "Chirag Chaman" <[EMAIL PROTECTED]>
>Reply-To: [EMAIL PROTECTED]
>To: <[EMAIL PROTECTED]>, <[email protected]>
>Subject: RE: [Nutch-dev] SegmentMergeTool without stopping Tomcat?
>Date: Sat, 15 Jan 2005 20:18:10 -0500
>
>Michael:
>
>First, you should always do a merge externally and then add it to the 
>server
>-- can lead to unexpected results otherwise. While some have said that 
>you can do a touch on the web.xml file and that should restart tomcat, 
>we have not even able to see that reliably.
>
>We added functionality that would reset the bean -- so that the Nutch 
>bean looks for the new segments when the bean is reset. We did this by 
>writing a servlet that is called whenever a new instance of the bean 
>needs to be created (which pings the servers for the new segments). On 
>linux the segments are still deletable, on Windows you may need to 
>write a function that goes and manually removes the segment name from 
>the array ( I don't have the code in front of me now -- if the above 
>does not make sense, send me an email and I'll get you the exact place to
make the change next week).
>
>There is another (easier) alternative if you're using a client/server 
>config and can invest in a load balance that sits between the web 
>client and segment server.
>
>Hope this helps
>
>CC-
>
>
>
>-----Original Message-----
>From: [EMAIL PROTECTED]
>[mailto:[EMAIL PROTECTED] On Behalf Of 
>Michael Sashnikov
>Sent: Saturday, January 15, 2005 6:46 PM
>To: [email protected]
>Subject: [Nutch-dev] SegmentMergeTool without stopping Tomcat?
>
>Is it possible to merge segments using SegmentMergeTool without 
>stopping Tomcat? It looks like if Tomcat runs it locks the old segment 
>files and SegmentMergeTool cannot delete them. As a result after 
>merging the search result includes both old and new versions of documents.
>
>Config info Windows XP + Nutch 0.6 + Tomcat 5.5 + JDK 1.5.0.
>
>Thanks
>
>
>
>
>-------------------------------------------------------
>The SF.Net email is sponsored by: Beat the post-holiday blues Get a 
>FREE limited edition SourceForge.net t-shirt from ThinkGeek.
>It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
>_______________________________________________
>Nutch-developers mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
>
>
>-------------------------------------------------------
>The SF.Net email is sponsored by: Beat the post-holiday blues Get a 
>FREE limited edition SourceForge.net t-shirt from ThinkGeek.
>It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
>_______________________________________________
>Nutch-developers mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE
limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to