thanks Lyndon for your quick reply. The above mentioned Urls i.e
http://servername/mac and http://servername/mac/index.html represent the
same resource.But nutch will index both of them ie it will index same page
two times.So i wanted to convert first form of URL to sencond form. ie  to
append /index.html.
I am really bad in regular expression .Although i am trying to create myself
.Can anyone send me the reqular expression .In which if URL does not end
with the extension like .jsp or .html  will have /index.html appended to it.
regards
Raj
On Tue, Apr 22, 2008 at 10:50 AM, Lyndon Maydwell <[EMAIL PROTECTED]>
wrote:

> regex-normalize.xml
>
> This allows you to transform urls based on regular expressions.
>
> So you could make one appear to be the other, or vice versa, or both
> appear to be a third.
>
> Rules are written like so:
>
> <regex-normalize>
> <regex>
>  <pattern>(https?://)www\.(.*)</pattern>
>  <substitution>$1$3</substitution>
> </regex>
> ...
>
> This example removes (www) from urls.
>
> On 4/22/08, Raj Malhotra <[EMAIL PROTECTED]> wrote:
> > Hi
> >  I have two urls  - 1) http://servername/mac and 2)
> >  http://servername/mac/index.html . Is it possible to tell nutch that
> these
> >  two urls are same through configurations.If any body knows to tackle
> this
> >  please explain me how to do this.
> >
> >  regards
> >
> > Raj
> >
>

Reply via email to