Two suggestions

misc Fri, 05 Oct 2007 18:27:15 -0700


Hello All-


   Two suggested (small) changes:

Change 1

Use case: Want a list of all ".mov" files found during crawl, don't wantto actually download them and store in the content database (too muchbandwidth and space!).

Partial solution: filter out with regex-urlfilter. Problem is, norecord of this url being parsed is stored anywhere


   Full proposed solution: Change code in ParseOutputFormat from

(line 173)

   toUrl = filters.filter(toUrl);   // filter the url
             if (toUrl == null) {
               continue;
             }

to (the new line 173)

   if (filters.filter(toUrl) == null)   // filter the url
                 {
                     LOG.debug("filtering out " + toUrl);
                     continue;
                 }

This way, all filtered out URLs can be saved if the log level is changedto debug. This is also useful to verify that stuff isn't accidentallygetting trown away in a parse.


Change 2

Add pdf the the default regex-urlfilter removal list. There doesn'tseem to be any pdf parser (yet), and my output logs are filled with errorsabout this.


                       thanks
                           -Jim

Two suggestions

Reply via email to