Mario,

Lucene != web indexer, so Lucene doesn't know anything about files or URLs, 
etc.  It just indexes what it's told.  You should check how Nutch does it, and 
I believe it does it by comparing "fingerprints" of web pages.  Fingerprints 
are MD5 checksums, but I believe the recent changes there allow you to define 
your own mechanism.

In any case, this is not really a question for [EMAIL PROTECTED]  nutch-user@ 
may be a a better place to ask.

Otis

----- Original Message ----
From: Mario Alejandro M. <[EMAIL PROTECTED]>
To: Lucene Developers List <[email protected]>
Sent: Fri 20 Jan 2006 05:27:01 PM EST
Subject: Indexing Urls pointing to same content

I found that in the data I'm searching I have a lot of duplicated content.
Only diference is that the url change, ie, one say
http://localhost/sample.html and the other http://localhost/sample2.html.
However, sample1 and sample2 are diferent files, that its, here is not
involved redirection or linking or nothing like that. Sample1 and Sample2
are two diferent pages copied in diferent dates but with the exact same
content.

I think this account for something like 20% of the cases, so I think is
valuable avoid to index all of this. So I'm thinking in build a
link/location + content databases, in one put the list of links/urls and in
the other only content, so I have a start structure around the content...

But I wondering if exist smart way to do this in the actual Lucene
1.4codebase....

--
Mario Alejandro Montoya
http://sourceforge.net/projects/mutis
MUTIS: The Open source Delphi search engine
AnyNET: Convert from ANY .NET assembly to Delphi code




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to