Hi all,
Working on domain specific search engine I faced a problem when sophisticated automatic classification algorithm was unable to categorize page correctly. Additionally this page can be categorized as offensive for many people so there was a strong pressure to remove it from an index. Sometimes after quick look at the data one can identify "spammer" sites - that contain many pages that should not be apart of an index despite being domain related. I had a look at PruneIndexTool but could not find a way to use it the way I wanted. In addition writing a small program that has exactly functionality descried above is quite easy.
So some time ago I wrote a small utility program that takes segment and simple text file with URLs or host names as parameters and removes given URL or sites from Lucene index in this segment. It is done in very simple way:
1) load all URL or site information into HashSet in memory
2) iterate over all Lucene documents removing unwanted ones.
It has a drawbacks as unwanted site/url information must fit into JVM memory. Usually it is not a big problem as one wants to remove special cases found manually. If it becomes a problem one can split url file into several files and process them one by one.
If anyone would find such addition useful and worth committing to nutch SVN I can update my source code (ASF license/ package names/ remove JDK 1.5 dependencies) and send it to the list.
Regards,
Piotr
------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. Get your fingers limbered up and give it your best shot. 4 great events, 4 opportunities to win big! Highest score wins.NEC IT Guy Games. Play to win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
