Re: Nutch/Lucene unique ID for every item crawled?

Sagar Naik Sun, 21 Oct 2007 08:38:57 -0700

hey

CRAWL 1:
       url: http://foo.com
       doc id =X
CRAWL 2:
       url: http://foo.com
       doc id =Y
X may be equal to Y

And yes, segment id is different for different crawls. It is timestampvalue and is the time when the

Generator is executed

May be if cud tell abt u r ultimate aim, we might be be able to help uappropriately





Sagar Vibhute wrote:

Hash value of the url does sound useful. Thanks! :-)

But well, is the segment ID different for every crawl? In which case the
segment ID + Doc Id can become a unique mapping. Trouble is, I don't know
how to extract the doc id of a particular document while it is being
crawled. I found a method which, given a doc Id gives the document, but
that's not what I need, I kinda need the opposite.

Any leads?

- Sagar


On 10/21/07, Sagar Naik <[EMAIL PROTECTED]> wrote:

Hey,
The lucene document id , an integer, may not be same for 2 different
crawls.
I am not sure if this is wht u r looking for but U can store a hash
value of the url crawled ;)

- Sagar

Sagar Vibhute wrote:

Hello,

Does nutch/lucene provide for a unique ID for every item that it has
crawled?

I checked the Lucene docid but from what I understood, the lucene docid

is

not unique for every item crawled. Is that so?

How can I get this unique ID, if it is available?

Thanks.

- Sagar

--
This message has been scanned for viruses and
dangerous content and is believed to be clean.



--
This message has been scanned for viruses and
dangerous content and is believed to be clean.

Re: Nutch/Lucene unique ID for every item crawled?

Reply via email to