Hi Erlend, The inclusions and exclusions are based solely on URL, and block the connector from fetching the file. Otherwise you would easily wind up fetching the entire web.
However, this raises an interesting issue as to whether there's a way in the web connector to do what you are trying to do, which is to filter based on URL after links have been extracted. The current inclusions/exclusions work fine for any URLs without links but do not allow for the case you are looking for. Can you create a ticket? The suggestion would be to introduce post-extraction inclusions and exclusions into the connector. Karl On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen <[email protected]> wrote: > > I just realized that if I exclude html files for a job, links in these files > will not be followed. Is this a desirable behaviour? Should links be > followed regardless of the exclude filter? > > I discovered this issue when I was going to crawl only pdfs and realized > that the job ended without finding any documents at all. I think I had > something like this in my include list: > http://foreninger.uio.no/.*\.pdf$ > http://folk.uio.no/.*\.pdf$ > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >
