Have there been any further developments on this thread? Karl
On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright <[email protected]> wrote: > Sure. But you've already convinced me we need a new feature. ;-) > > Karl > > On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen <[email protected]> > wrote: >> >> Sure, I can create a ticket. But first I want to discuss this issue with the >> two search consultants we have hired. >> >> I decided to post to the dev list in order to get some feedback on this >> issue. >> >> Erlend >> >> On 20.06.11 18.00, Karl Wright wrote: >>> >>> Hi Erlend, >>> >>> The inclusions and exclusions are based solely on URL, and block the >>> connector from fetching the file. Otherwise you would easily wind up >>> fetching the entire web. >>> >>> However, this raises an interesting issue as to whether there's a way >>> in the web connector to do what you are trying to do, which is to >>> filter based on URL after links have been extracted. The current >>> inclusions/exclusions work fine for any URLs without links but do not >>> allow for the case you are looking for. >>> >>> Can you create a ticket? The suggestion would be to introduce >>> post-extraction inclusions and exclusions into the connector. >>> >>> Karl >>> >>> >>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen >>> <[email protected]> wrote: >>>> >>>> I just realized that if I exclude html files for a job, links in these >>>> files >>>> will not be followed. Is this a desirable behaviour? Should links be >>>> followed regardless of the exclude filter? >>>> >>>> I discovered this issue when I was going to crawl only pdfs and realized >>>> that the job ended without finding any documents at all. I think I had >>>> something like this in my include list: >>>> http://foreninger.uio.no/.*\.pdf$ >>>> http://folk.uio.no/.*\.pdf$ >>>> >>>> Erlend >>>> >>>> -- >>>> Erlend Garåsen >>>> Center for Information Technology Services >>>> University of Oslo >>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>> 31050 >>>> >> >> >> -- >> Erlend Garåsen >> Center for Information Technology Services >> University of Oslo >> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >> >
