hey

CRAWL 1:
       url: http://foo.com
       doc id =X
CRAWL 2:
       url: http://foo.com
       doc id =Y
X may be equal to Y

And yes, segment id is different for different crawls. It is timestamp value and is the time when the
Generator is executed

May be if cud tell abt u r ultimate aim, we might be be able to help u appropriately




Sagar Vibhute wrote:
Hash value of the url does sound useful. Thanks! :-)

But well, is the segment ID different for every crawl? In which case the
segment ID + Doc Id can become a unique mapping. Trouble is, I don't know
how to extract the doc id of a particular document while it is being
crawled. I found a method which, given a doc Id gives the document, but
that's not what I need, I kinda need the opposite.

Any leads?

- Sagar


On 10/21/07, Sagar Naik <[EMAIL PROTECTED]> wrote:
Hey,
The lucene document id , an integer, may not be same for 2 different
crawls.
I am not sure if this is wht u r looking for but U can store a hash
value of the url crawled ;)

- Sagar

Sagar Vibhute wrote:
Hello,

Does nutch/lucene provide for a unique ID for every item that it has
crawled?

I checked the Lucene docid but from what I understood, the lucene docid
is
not unique for every item crawled. Is that so?

How can I get this unique ID, if it is available?

Thanks.

- Sagar


--
This message has been scanned for viruses and
dangerous content and is believed to be clean.





--
This message has been scanned for viruses and
dangerous content and is believed to be clean.

Reply via email to