You are right. I had to add a custom plugin - InvalidUrlIndexFilter which filters out all the invalid urls while indexing the pages/files. Check out this blog: http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
Just follow the process of creating/adding a new custom plugin http://wiki.apache.org/nutch/WritingPluginExample-0.9 <http://wiki.apache.org/nutch/WritingPluginExample-0.9>After adding this plugin, I was able to index the files by skipping this index page...hope this helps... On Thu, Apr 29, 2010 at 12:31 PM, arpit khurdiya <arpitkhurd...@gmail.com>wrote: > I m also facing the same problem.. > > i thought of devlop a plugin that will return null when such URL is > encountered and will return null. As a result that URl wont be > indexed. > > But i was thinking what will be the criteria on the basis of which i > ll discard the URl. > > I hope my approach is correct. > > On Thu, Apr 29, 2010 at 9:59 AM, xiao yang <yangxiao9...@gmail.com> wrote: > > Because it's a URL indeed. > > You can either filter this kind of URL by configuring > > crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the > > regular expression) or filter the search result (you need to develop a > > nutch plugin). > > Thanks! > > > > Xiao > > > > On Thu, Apr 29, 2010 at 4:33 AM, BK <bk4...@gmail.com> wrote: > >> While indexing files on local file system, why does NUTCH interpret the > >> directory as a URL - fetching file:/C:/temp/html/ > >> This causes the index page of this directory to show up on search > results. > >> Any solutions for this issue?? > >> > >> > >> Bharteesh Kulkarni > >> > > > > > > -- > Regards, > Arpit Khurdiya >