Hi,
using search http is a bad idea, since you get many but not all pages.
Just write a hadoop map reduce job that process the fetched content
in your segments, that should be easy.
Storing images in a file system will be very slow as soon you have
too many.
I personal don't like databases since compared to nutch they are slow
as a snail.
For a other project also related to images I had created a own
ImageWritable that contained the binary data of a compressed image
compared with some meta data.
If you use a MapFile finding a image based on a key should be very
fast. I think much faster than a database with binary content.
HTH
Stefan
Am 02.06.2006 um 21:10 schrieb Marco Pereira:
Hi Everybody,
I've got nutch to index images searching it's url and alt and title
tags.
But the problem comes when storing the thumbnails.
I`ve indexed 3million images for a national search engine.
I was in doubt wheter I use a file system scheme or a database to
store the
thumbnails.
The thumbnails are created with a script that gets the image urls from
nutch index doing a search for http (search.jsp?query=http).
Do you have any tips, ideas on this?
Thanks you,
Marco
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com