[EMAIL PROTECTED] wrote:
> I am using nutch to crawl & index an intranet consisting of an initial
> fixed set of urls (approx. 3000). For my application I need to reference
> some metadata (stored in a database) for each of the original 3000 urls.
>
> Does nutch assign a unique integer id for each starting url in the
> crawldb? If so, does the API allow me to get it? When a search is
> performed can/is this id returned for each 'hit'?
>   

Nutch uses the full URL as a unique identifier.

If your collection is relatively small (in the order of a few million 
docs or less) you can use MD5Hash.digest(url).halfDigest(), which 
returns a long value - and with pretty good confidence it should be unique.

> I want my 'display search results' page to return the nutch results for
> each 'hit' as well as the metadata for the hit url if it is one of the
> original 3000. I'd rather use an integer ID than have to match on the url
> string itself.
>   

Nutch doesn't number the URLs, so you will need to somehow map URLs to 
integers. You could do this sequentially, but each time you add/remove 
URLs form the crawldb you will get different numbers for the same URLs. 
You could also use a perfect hash function which maps String to Integer, 
but even in this case you would have a small probability that existing 
URLs will be re-numbered. The space of int is too small to use random 
hashing and hope there are no collisions.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to