thanks Lyndon for your quick reply. The above mentioned Urls i.e http://servername/mac and http://servername/mac/index.html represent the same resource.But nutch will index both of them ie it will index same page two times.So i wanted to convert first form of URL to sencond form. ie to append /index.html. I am really bad in regular expression .Although i am trying to create myself .Can anyone send me the reqular expression .In which if URL does not end with the extension like .jsp or .html will have /index.html appended to it. regards Raj On Tue, Apr 22, 2008 at 10:50 AM, Lyndon Maydwell <[EMAIL PROTECTED]> wrote:
> regex-normalize.xml > > This allows you to transform urls based on regular expressions. > > So you could make one appear to be the other, or vice versa, or both > appear to be a third. > > Rules are written like so: > > <regex-normalize> > <regex> > <pattern>(https?://)www\.(.*)</pattern> > <substitution>$1$3</substitution> > </regex> > ... > > This example removes (www) from urls. > > On 4/22/08, Raj Malhotra <[EMAIL PROTECTED]> wrote: > > Hi > > I have two urls - 1) http://servername/mac and 2) > > http://servername/mac/index.html . Is it possible to tell nutch that > these > > two urls are same through configurations.If any body knows to tackle > this > > please explain me how to do this. > > > > regards > > > > Raj > > >
