I m also facing the same problem.. i thought of devlop a plugin that will return null when such URL is encountered and will return null. As a result that URl wont be indexed.
But i was thinking what will be the criteria on the basis of which i ll discard the URl. I hope my approach is correct. On Thu, Apr 29, 2010 at 9:59 AM, xiao yang <yangxiao9...@gmail.com> wrote: > Because it's a URL indeed. > You can either filter this kind of URL by configuring > crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the > regular expression) or filter the search result (you need to develop a > nutch plugin). > Thanks! > > Xiao > > On Thu, Apr 29, 2010 at 4:33 AM, BK <bk4...@gmail.com> wrote: >> While indexing files on local file system, why does NUTCH interpret the >> directory as a URL - fetching file:/C:/temp/html/ >> This causes the index page of this directory to show up on search results. >> Any solutions for this issue?? >> >> >> Bharteesh Kulkarni >> > -- Regards, Arpit Khurdiya