I just realized that if I exclude html files for a job, links in these files will not be followed. Is this a desirable behaviour? Should links be followed regardless of the exclude filter?
I discovered this issue when I was going to crawl only pdfs and realized that the job ended without finding any documents at all. I think I had something like this in my include list:
http://foreninger.uio.no/.*\.pdf$ http://folk.uio.no/.*\.pdf$ Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
