Indexing Urls pointing to same content

Mario Alejandro M. Fri, 20 Jan 2006 14:27:23 -0800

I found that in the data I'm searching I have a lot of duplicated content.
Only diference is that the url change, ie, one say
http://localhost/sample.html and the other http://localhost/sample2.html.
However, sample1 and sample2 are diferent files, that its, here is not
involved redirection or linking or nothing like that. Sample1 and Sample2
are two diferent pages copied in diferent dates but with the exact same
content.


I think this account for something like 20% of the cases, so I think is
valuable avoid to index all of this. So I'm thinking in build a
link/location + content databases, in one put the list of links/urls and in
the other only content, so I have a start structure around the content...

But I wondering if exist smart way to do this in the actual Lucene
1.4codebase....

--
Mario Alejandro Montoya
http://sourceforge.net/projects/mutis
MUTIS: The Open source Delphi search engine
AnyNET: Convert from ANY .NET assembly to Delphi code

Indexing Urls pointing to same content

Reply via email to