FYI, I have not used DeleteDuplicate as in our system the concept of duplicates is a little different. But if it works within the context of the Nutch bean, you should be fine, If not (ie. Is it called as an external program) look below
You could get away with using it as the Segment still exists. The problem now is that if the index actaully needs to get the summary for a hit from that segment and does not find it, it will throw a null pointer and you'll see nothing. Again, this is simple to fix by changing the Nutch code -- catch the exception and simply skip over that one. Though, to get stuff from getting corrupted, I would suggest doing all external manipulation offline -- from my understanding, currently nutch supports reading/wrting from one JVM, so if you can delete withing that context, it's fine -- external manipulation should be external. This *may* change in the future. See the map and reduce email exchanges over the last day or so. There is talk about allowing multiple crawlers to write to one segment. Here's a simple strategy that we initially used when we did not use multiple computers and replication -- run the external tools to figure out the deletions, but instead of deleting write to a log file. Then call another program that marks them deletable in a safe manner. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Sashnikov Sent: Saturday, January 15, 2005 10:35 PM To: [EMAIL PROTECTED] Subject: RE: [Nutch-dev] SegmentMergeTool without stopping Tomcat? Chirag, Thanks a lot for your fast response. It was very helpful. In case I decide to delete duplicates using DeleteDuplicates instead of SegmentMergeTool should it be done externally as well? Thanks, Michael >From: "Chirag Chaman" <[EMAIL PROTECTED]> >Reply-To: [EMAIL PROTECTED] >To: <[EMAIL PROTECTED]>, <[email protected]> >Subject: RE: [Nutch-dev] SegmentMergeTool without stopping Tomcat? >Date: Sat, 15 Jan 2005 20:18:10 -0500 > >Michael: > >First, you should always do a merge externally and then add it to the >server >-- can lead to unexpected results otherwise. While some have said that >you can do a touch on the web.xml file and that should restart tomcat, >we have not even able to see that reliably. > >We added functionality that would reset the bean -- so that the Nutch >bean looks for the new segments when the bean is reset. We did this by >writing a servlet that is called whenever a new instance of the bean >needs to be created (which pings the servers for the new segments). On >linux the segments are still deletable, on Windows you may need to >write a function that goes and manually removes the segment name from >the array ( I don't have the code in front of me now -- if the above >does not make sense, send me an email and I'll get you the exact place to make the change next week). > >There is another (easier) alternative if you're using a client/server >config and can invest in a load balance that sits between the web >client and segment server. > >Hope this helps > >CC- > > > >-----Original Message----- >From: [EMAIL PROTECTED] >[mailto:[EMAIL PROTECTED] On Behalf Of >Michael Sashnikov >Sent: Saturday, January 15, 2005 6:46 PM >To: [email protected] >Subject: [Nutch-dev] SegmentMergeTool without stopping Tomcat? > >Is it possible to merge segments using SegmentMergeTool without >stopping Tomcat? It looks like if Tomcat runs it locks the old segment >files and SegmentMergeTool cannot delete them. As a result after >merging the search result includes both old and new versions of documents. > >Config info Windows XP + Nutch 0.6 + Tomcat 5.5 + JDK 1.5.0. > >Thanks > > > > >------------------------------------------------------- >The SF.Net email is sponsored by: Beat the post-holiday blues Get a >FREE limited edition SourceForge.net t-shirt from ThinkGeek. >It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt >_______________________________________________ >Nutch-developers mailing list >[email protected] >https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > >------------------------------------------------------- >The SF.Net email is sponsored by: Beat the post-holiday blues Get a >FREE limited edition SourceForge.net t-shirt from ThinkGeek. >It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt >_______________________________________________ >Nutch-developers mailing list >[email protected] >https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
