The best way to get this included is to submit a JIRA ticket and include
your patch below. One or more of the commiters, time allowing, will
then take a look at your patch for inclusion.
Dennis Kubes
misc wrote:
Hi all-
I asked for this before but no one answered, so I will try again.
I have included a svn diff with a small proposed change to the code that
would allow users to track found but filtered content in the crawl. This is
useful both as a diagnostic tool (let's see what we are skipping) as well as a
way to find content and links pointed to by a page or site without having to
actually download that content.
I have set the log level to info, perhaps it should be debug.
I think this would be a useful addition for many users.
Could someone make this change? If I am misunderstanding something and
there are better ways to already do this, what are they?
see you
-Jim
Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java (revision 593619)
+++ ParseOutputFormat.java (working copy)
@@ -193,17 +193,20 @@
toHost = null;
}
if (toHost == null || !toHost.equals(fromHost)) { // external
links
+ LOG.info("filtering externalLink " + toUrl + " linked to by " +
fromUrl);
+
continue; // skip it
}
}
try {
toUrl = normalizers.normalize(toUrl,
URLNormalizers.SCOPE_OUTLINK); // normalize the url
- toUrl = filters.filter(toUrl); // filter the url
- if (toUrl == null) {
- continue;
- }
- } catch (Exception e) {
+
+ if (filters.filter(toUrl) == null) { // filter the url
+ LOG.info("filtering content " + toUrl + " linked to by "
+ fromUrl);
+ continue;
+ }
+ } catch (Exception e) {
continue;
}
CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval);