Excluding html files and following links

Erlend Garåsen Mon, 20 Jun 2011 07:55:17 -0700

I just realized that if I exclude html files for a job, links in thesefiles will not be followed. Is this a desirable behaviour? Should linksbe followed regardless of the exclude filter?

I discovered this issue when I was going to crawl only pdfs and realizedthat the job ended without finding any documents at all. I think I hadsomething like this in my include list:

http://foreninger.uio.no/.*\.pdf$
http://folk.uio.no/.*\.pdf$

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Excluding html files and following links

Reply via email to