As I wrote I had a look at PruneIndexTool before writing my own one.As I understand PruneIndexTool removes data from index based on Lucene query. For me it was not obvious how to construct a lucene query to remove:
www.abc.com but not (www.def.abc.com or www.def.com/abc or www.aabc.com/def) taking into account all strange things nutch tokenizer does with such addresses.
I also look at the possibility of integration but found out that your tool does not go through all documents - doing full text search instead.
So control flow is very different in both cases, but of course we can try to create a common facade for both tools.
I think both tools serve a little bit different purpose - one removes
all documents found by given Lucene query - excellent for removing of all pages containing given bad word or bad phrase (lets take "mortage refinancement" as an example). In the second case one wants to remove
some specific pages or sites from an index - I agree your tool is more general and probably it can be achieved using some phrase queries
and conversion of queries using NutchDocumentAnalyzer but I think
use case for usage of simpler tool is quite common (especially for domain restricted search engines) and for nutch users (maybe not experts in lucene) such tool might be of some value.
I am not sure if it should be added to nutch - this is why I wrote an email and not started to port it before recieving comments.
I hope I explained my thinking behind reinventing the circle :)
I am not planning to do anything on it right now - if no other person finds it useful I can live with using it on my own. I am just going one by one through features I have implemented and checking if they might be of some interest to nutch community. We benefited a lot just by using nutch so giving back my small fixes and tools is our small attempt to help others and push the whole thing forward.
Regards Piotr
Andrzej Bialecki wrote:
Piotr Kosiorowski wrote:
If anyone would find such addition useful and worth committing to nutch SVN I can update my source code (ASF license/ package names/ remove JDK 1.5 dependencies) and send it to the list.
You're a bit late ;-) Please take a look at PruneIndexTool. If you think your tool solves this or that in a better way, no problem - we can merge these two.
------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. Get your fingers limbered up and give it your best shot. 4 great events, 4 opportunities to win big! Highest score wins.NEC IT Guy Games. Play to win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
