My appliation access opencms functionality which internally converts first form of URL into second form.Problem was same page is accessible with Url of first form and url of second form from different pages.so i wanted to convert them into one form.The second approach ,you suggested ,is better and i am implementing it.I was not sure my manager would agree or not regarding it.I think he should , as there is valid reason.In case if he does not then i will ask for first approach.
On Tue, Apr 22, 2008 at 5:25 PM, Richard Cyganiak <[EMAIL PROTECTED]> wrote: > Raj, as I said doing it the other way round (adding index.html if the URL > doesn't end in .html or .jsp) will often create dead links, while removing > index.html is practically always safe. Any specific reason why you would > prefer the first approach? > > Richard > > > > On 22 Apr 2008, at 12:44, Raj Malhotra wrote: > > > Hi Richard > > welcome to the community.I got your point i think this is bit more > > simple > > approach.Still if somebody give the other way round .so that we can > > document in this post itself. > > > > regards > > Raj > > > > On Tue, Apr 22, 2008 at 4:52 PM, Richard Cyganiak <[EMAIL PROTECTED]> > > wrote: > > > > Raj, > > > > > > I didn't test it but this should work: > > > > > > <regex-normalize> > > > <regex> > > > <pattern>(.*)/index\.(html|htm|jsp)</pattern> > > > <substitution>$1</substitution> > > > </regex> > > > > > > It *removes* the index.html part from the end of the URL, which is a > > > better approach, because *adding* index.html will often result in a > > > dead > > > URL. > > > > > > You can add more extensions into the parentheses if required. > > > > > > (As this is my first post to the list, a short introduction: I'm > > > working > > > at the Digital Enterprise Research Institute in Galway, Ireland. We > > > are > > > using Nutch to crawl RDF and microformats for an experimental "Web of > > > Data" > > > search engine called Sindice.) > > > > > > Best, > > > Richard > > > > > > > > > > > > On 22 Apr 2008, at 11:50, Raj Malhotra wrote: > > > > > > thanks Lyndon for your quick reply. The above mentioned Urls i.e > > > > http://servername/mac and http://servername/mac/index.html represent > > > > the > > > > same resource.But nutch will index both of them ie it will index > > > > same > > > > page > > > > two times.So i wanted to convert first form of URL to sencond form. > > > > ie > > > > to > > > > append /index.html. > > > > I am really bad in regular expression .Although i am trying to > > > > create > > > > myself > > > > .Can anyone send me the reqular expression .In which if URL does not > > > > end > > > > with the extension like .jsp or .html will have /index.html > > > > appended to > > > > it. > > > > regards > > > > Raj > > > > On Tue, Apr 22, 2008 at 10:50 AM, Lyndon Maydwell < > > > > [EMAIL PROTECTED]> > > > > wrote: > > > > > > > > regex-normalize.xml > > > > > > > > > > > > > > This allows you to transform urls based on regular expressions. > > > > > > > > > > So you could make one appear to be the other, or vice versa, or > > > > > both > > > > > appear to be a third. > > > > > > > > > > Rules are written like so: > > > > > > > > > > <regex-normalize> > > > > > <regex> > > > > > <pattern>(https?://)www\.(.*)</pattern> > > > > > <substitution>$1$3</substitution> > > > > > </regex> > > > > > ... > > > > > > > > > > This example removes (www) from urls. > > > > > > > > > > On 4/22/08, Raj Malhotra <[EMAIL PROTECTED]> wrote: > > > > > > > > > > Hi > > > > > > I have two urls - 1) http://servername/mac and 2) > > > > > > http://servername/mac/index.html . Is it possible to tell nutch > > > > > > that > > > > > > > > > > > > these > > > > > > > > > > two urls are same through configurations.If any body knows to > > > > > > tackle > > > > > > > > > > > > this > > > > > > > > > > please explain me how to do this. > > > > > > > > > > > > regards > > > > > > > > > > > > Raj > > > > > > > > > > > > > > > > > > > > > > >
