Hello All-
Two suggested (small) changes:
Change 1
Use case: Want a list of all ".mov" files found during crawl, don't want
to actually download them and store in the content database (too much
bandwidth and space!).
Partial solution: filter out with regex-urlfilter. Problem is, no
record of this url being parsed is stored anywhere
Full proposed solution: Change code in ParseOutputFormat from
(line 173)
toUrl = filters.filter(toUrl); // filter the url
if (toUrl == null) {
continue;
}
to (the new line 173)
if (filters.filter(toUrl) == null) // filter the url
{
LOG.debug("filtering out " + toUrl);
continue;
}
This way, all filtered out URLs can be saved if the log level is changed
to debug. This is also useful to verify that stuff isn't accidentally
getting trown away in a parse.
Change 2
Add pdf the the default regex-urlfilter removal list. There doesn't
seem to be any pdf parser (yet), and my output logs are filled with errors
about this.
thanks
-Jim