Re: Doublets

Jasper Kamperman Wed, 08 Oct 2008 08:56:30 -0700

There is a url normalizing feature in

        conf/regex-normalize.xml

For example I used the following pattern to say that pages with&limit and &limitstart are the same on a JasperSoft forum


<!-- Jasper normalize jaspersoft URLs:
     (1) &limit=6&limitstart=0 means the same as the page w/o any limit
     (2) catid=10&id=NNN means the same as id=NNN&catid=10  -->
<regex-normalize>
<regex>

<pattern>(\?|\&|\&amp;)limit=6(\&|\&amp;)limitstart=0$</pattern>

  <substitution></substitution>
</regex>
<regex>

<pattern>(\?|\&|\&amp;)(id=[0-9]+)(\&|\&amp;)(catid=10)(.*)</pattern>

  <substitution>$1$4$3$2$5</substitution>
</regex>
</regex-normalize>


On Oct 8, 2008, at 4:23 AM, Detlef Müller-Solger wrote:

Hi,
in Germany it is reported, that one big show stopper for Nutch isthe fact, that there for example are often identical webpage’swhich can be addressed by different URLs. For example by requesting
www.xyz.de/information
or by
www.xyz.de/information/
or by
www.xyz.de/information/index
From my point of view due to the different URLs Nutch is indexingthose webpages unfortuneately three times. Is there a method toavoid the indexing of these doublets? For example by comparing allinformation of the webpage excluding the URL.
Note: A Fliter like "reduce URL generally of "/index"" is nosolution because in other cases of the same run "/index" may beneeded or the same Webpage can be adressed also by other URL Syntax.
Thanx

Detlef Müller-Solger

Re: Doublets

Reply via email to