Re: [Nutch-dev] Copyrighted Sites

Andrzej Bialecki Mon, 25 Oct 2004 18:47:38 -0700

Jason Boss wrote:

Hey all,
First time for this one...can anyone shed some light on this?
This guy called and was really crazy on the phone. I told him to send us an

I'm not a lawyer, but their content is already cached in many other search engines; so why don't they go and sue Google... It would be funny, if not for the fact that such people like to sue everybody in their sight, but especially if the target is not likely to have a huge budget for legal spending...

email.  First how do you deal with issues like this?  Second how do we pull
individual sites from an index and cache?

First, you need to remove them from the DB, otherwise their sites will pop up when you generate fetchlists. You can do this by doing something like this (this is currently not tested, and it looks needlessly complicated...):

        WebDBReader reader = new WebDBReader(..);
        WebDBWriter writer = new WebDBWriter(..);
        // enumerate all pages in the reader
        for (Enumeration e = reader.pages(); e.hasMoreElements(); ) {
                Page p = (Page)e.nextElement();
                String url = p.getURL().toString();
                if (url.indexOf(forbiddenURL) != -1) {
                        writer.deletePage(url);
                        Link[] links = reader.getLinks(p.getURL());
                        for (int i = 0; i < links.length; i++)
                                
writer.deleteLink(MD5Hash.digest(links[i].getURL().toString()));
                }
        }
        reader.close();
        // now commit changes to the webDB
        writer.close();

Then, you need to make sure that they won't be added again to the webDB. You can do this by specifying appropriate exclusion patterns in RegexURLFilter file.

And finally, you can delete their entries from the current indexes, using something like this:

org.apache.lucene.index.IndexReader reader = new IndexReader(pathToLuceneIndexDir); // if you used the plugin for "site:", this is simple: reader.delete(new Term("site", forbiddenSiteHostName)); reader.close(); ...

// alternatively, you can construct PrefixQuery and delete // all matching documents PrefixQuery pq = new PrefixQuery(new Term("url", "http://"; + forbiddenHostName)); IndexSearcher s = new IndexSearcher(reader); Hits h = s.search(pq); for (int i = 0; i < h.length(); i++) { reader.delete(h.id(i)); } reader.close(); ...

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Copyrighted Sites

Reply via email to