Hello All-

   Two suggested (small) changes:

Change 1

Use case: Want a list of all ".mov" files found during crawl, don't want to actually download them and store in the content database (too much bandwidth and space!).

Partial solution: filter out with regex-urlfilter. Problem is, no record of this url being parsed is stored anywhere

   Full proposed solution: Change code in ParseOutputFormat from

(line 173)

   toUrl = filters.filter(toUrl);   // filter the url
             if (toUrl == null) {
               continue;
             }

to (the new line 173)

   if (filters.filter(toUrl) == null)   // filter the url
                 {
                     LOG.debug("filtering out " + toUrl);
                     continue;
                 }

This way, all filtered out URLs can be saved if the log level is changed to debug. This is also useful to verify that stuff isn't accidentally getting trown away in a parse.

Change 2

Add pdf the the default regex-urlfilter removal list. There doesn't seem to be any pdf parser (yet), and my output logs are filled with errors about this.

                       thanks
                           -Jim

Reply via email to