Raj,

I didn't test it but this should work:

<regex-normalize>
<regex>
<pattern>(.*)/index\.(html|htm|jsp)</pattern>
<substitution>$1</substitution>
</regex>

It *removes* the index.html part from the end of the URL, which is a better approach, because *adding* index.html will often result in a dead URL.

You can add more extensions into the parentheses if required.

(As this is my first post to the list, a short introduction: I'm working at the Digital Enterprise Research Institute in Galway, Ireland. We are using Nutch to crawl RDF and microformats for an experimental “Web of Data” search engine called Sindice.)

Best,
Richard


On 22 Apr 2008, at 11:50, Raj Malhotra wrote:
thanks Lyndon for your quick reply. The above mentioned Urls i.e
http://servername/mac and http://servername/mac/index.html represent the same resource.But nutch will index both of them ie it will index same page two times.So i wanted to convert first form of URL to sencond form. ie to
append /index.html.
I am really bad in regular expression .Although i am trying to create myself .Can anyone send me the reqular expression .In which if URL does not end with the extension like .jsp or .html will have /index.html appended to it.
regards
Raj
On Tue, Apr 22, 2008 at 10:50 AM, Lyndon Maydwell <[EMAIL PROTECTED]>
wrote:

regex-normalize.xml

This allows you to transform urls based on regular expressions.

So you could make one appear to be the other, or vice versa, or both
appear to be a third.

Rules are written like so:

<regex-normalize>
<regex>
<pattern>(https?://)www\.(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
...

This example removes (www) from urls.

On 4/22/08, Raj Malhotra <[EMAIL PROTECTED]> wrote:
Hi
I have two urls  - 1) http://servername/mac and 2)
http://servername/mac/index.html . Is it possible to tell nutch that
these
two urls are same through configurations.If any body knows to tackle
this
please explain me how to do this.

regards

Raj


Reply via email to