Re: why does nutch interpret directory as URL

arpit khurdiya Thu, 29 Apr 2010 09:31:56 -0700

I m also facing the same problem..

i thought of devlop a plugin  that will return null when such  URL is
encountered and will return null. As a result that URl wont be
indexed.


But i was thinking what will be the criteria on the basis of which i
ll discard the URl.

I hope my approach is correct.

On Thu, Apr 29, 2010 at 9:59 AM, xiao yang <yangxiao9...@gmail.com> wrote:
> Because it's a URL indeed.
> You can either filter this kind of URL by configuring
> crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
> regular expression) or filter the search result (you need to develop a
> nutch plugin).
> Thanks!
>
> Xiao
>
> On Thu, Apr 29, 2010 at 4:33 AM, BK <bk4...@gmail.com> wrote:
>> While indexing files on local file system, why does NUTCH interpret the
>> directory as a URL - fetching file:/C:/temp/html/
>> This causes the index page of this directory to show up on search results.
>> Any solutions for this issue??
>>
>>
>> Bharteesh Kulkarni
>>
>



-- 
Regards,
Arpit Khurdiya

Re: why does nutch interpret directory as URL

Reply via email to