Hi Otis, On Sat, Apr 26, 2008 at 2:09 AM, <[EMAIL PROTECTED]> wrote:
> Hello, > > How are people dealing with avoiding page duplication where URL are > similar, > but the content is identical. I know there is page content fingerprinting > and > shingling (MD5Signature and TextProfileSignature), but that assumes you > already fetched the content. I am wondering if its possible to detect > things > earlier than that, even if it's not 10)% reliable. > > Concretely, imagine the following URLs: > http://example.com > http://www.example.com > http://www1.example.com > http://www2.example.com > > They are all, very likely, pointing to the same page. One person may link > to > www.example.com and the other person may link to just example.com, thus > we parse 2 different URLs, when ideally we'd want just a single URL for > each > page. Similarly, the example site may have multiple web servers (e.g. for > load balancing), but each with a slightly different name (e.g. www1.... , > www2....), > pointing to the same site. > > What's the best way to treat www1 and www2 as just www? > Are people using regex-normalize.xml for that? I remember a similar discussion a while ago. IIRC, Andrzej suggested that we may use a form of host analysis if different hosts are mirrors of each other. That is, we run an extra job to check if two different domains has a high enough number of same pages (same here being defined by same fingerprint) with the same path, if so we then choose the overall higher scoring host (or if these two hosts are under same domain, then we choose the shortest one or the one starting with www. etc.) as the representative host and merge all urls under those hosts. > > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > -- Doğacan Güney
