Have there been any further developments on this thread?
Karl

On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright <[email protected]> wrote:
> Sure.  But you've already convinced me we need a new feature. ;-)
>
> Karl
>
> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen <[email protected]> 
> wrote:
>>
>> Sure, I can create a ticket. But first I want to discuss this issue with the
>> two search consultants we have hired.
>>
>> I decided to post to the dev list in order to get some feedback on this
>> issue.
>>
>> Erlend
>>
>> On 20.06.11 18.00, Karl Wright wrote:
>>>
>>> Hi Erlend,
>>>
>>> The inclusions and exclusions are based solely on URL, and block the
>>> connector from fetching the file.  Otherwise you would easily wind up
>>> fetching the entire web.
>>>
>>> However, this raises an interesting issue as to whether there's a way
>>> in the web connector to do what you are trying to do, which is to
>>> filter based on URL after links have been extracted.  The current
>>> inclusions/exclusions work fine for any URLs without links but do not
>>> allow for the case you are looking for.
>>>
>>> Can you create a ticket?  The suggestion would be to introduce
>>> post-extraction inclusions and exclusions into the connector.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>> <[email protected]>  wrote:
>>>>
>>>> I just realized that if I exclude html files for a job, links in these
>>>> files
>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>> followed regardless of the exclude filter?
>>>>
>>>> I discovered this issue when I was going to crawl only pdfs and realized
>>>> that the job ended without finding any documents at all. I think I had
>>>> something like this in my include list:
>>>> http://foreninger.uio.no/.*\.pdf$
>>>> http://folk.uio.no/.*\.pdf$
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Reply via email to