Re: Can two different urls be configured as same ?

Richard Cyganiak Tue, 22 Apr 2008 04:56:10 -0700

Raj, as I said doing it the other way round (adding index.html if theURL doesn't end in .html or .jsp) will often create dead links, whileremoving index.html is practically always safe. Any specific reasonwhy you would prefer the first approach?


Richard



On 22 Apr 2008, at 12:44, Raj Malhotra wrote:

Hi Richard
welcome to the community.I got your point i think this is bit moresimple
approach.Still if somebody give the other way round .so  that we can
document in this post itself.

regards
Raj
On Tue, Apr 22, 2008 at 4:52 PM, Richard Cyganiak<[EMAIL PROTECTED]>
wrote:
Raj,

I didn't test it but this should work:

<regex-normalize>
<regex>
<pattern>(.*)/index\.(html|htm|jsp)</pattern>
<substitution>$1</substitution>
</regex>

It *removes* the index.html part from the end of the URL, which is a
better approach, because *adding* index.html will often result in adead
URL.

You can add more extensions into the parentheses if required.
(As this is my first post to the list, a short introduction: I'mworkingat the Digital Enterprise Research Institute in Galway, Ireland. Weareusing Nutch to crawl RDF and microformats for an experimental "Webof Data"
search engine called Sindice.)

Best,
Richard



On 22 Apr 2008, at 11:50, Raj Malhotra wrote:
thanks Lyndon for your quick reply. The above mentioned Urls i.e
http://servername/mac and http://servername/mac/index.htmlrepresent thesame resource.But nutch will index both of them ie it will indexsame
page
two times.So i wanted to convert first form of URL to sencondform. ie
to
append /index.html.
I am really bad in regular expression .Although i am trying tocreate
myself
.Can anyone send me the reqular expression .In which if URL doesnot endwith the extension like .jsp or .html will have /index.htmlappended to
it.
regards
Raj
On Tue, Apr 22, 2008 at 10:50 AM, Lyndon Maydwell <[EMAIL PROTECTED]>
wrote:

regex-normalize.xml
This allows you to transform urls based on regular expressions.
So you could make one appear to be the other, or vice versa, orboth
appear to be a third.

Rules are written like so:

<regex-normalize>
<regex>
<pattern>(https?://)www\.(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
...

This example removes (www) from urls.

On 4/22/08, Raj Malhotra <[EMAIL PROTECTED]> wrote:
Hi
I have two urls  - 1) http://servername/mac and 2)
http://servername/mac/index.html . Is it possible to tell nutchthat
these
two urls are same through configurations.If any body knows totackle
this
please explain me how to do this.

regards

Raj

Re: Can two different urls be configured as same ?

Reply via email to