Having a image search component for nutch would be nice.
However I think we need to implement this as a kind of separated tool outside of the nutch code itself, since it is not 100 % integrateable into the nutch code.
(E.G. Nutch define one url == one index document.)
May be this would be a nice project for a nutch sandbox.
If you like you can open an issue to request a nutch sandbox project "image search". If we got enough people vote for this issue we may have a chance to got it created.

Stefan

Am 03.06.2006 um 10:38 schrieb TDLN:

I am interested in developing such a solution as well.

I am currently storing the thumbnails on the file system under a
system generated name. My indexing plugin stores the filename in the
index. Thumbnails are later served to the client by seperate Apache
HTTP server. This required some changes but is otherwise pretty
straight forward and performs very well for my current 300.000+
images, around 15kb each.

If you are developing the more "Nutch-like" solution I could
contribute to that. For instance; I have some code that generates the
thumbs using ImageJ that yields very good results.

But I would definitely need some guidance in writing the hadoop map
reduce job. we could even contribute this back and base a small
tutorial on this work.

What do you think?

Rgrds, Thomas

On 6/2/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Hi,
using search http is a bad idea, since you get many but not all pages.
Just write a hadoop map reduce job that process the fetched content
in your segments, that should be easy.
Storing images in a file system will be very slow as soon you have
too many.
I personal don't like databases since compared to nutch they are slow
as a snail.
For a other project also related to images I had created a own
ImageWritable that contained the binary data of a compressed image
compared with some meta data.
If you use a MapFile finding a image based on a key should be very
fast. I think much faster than a database with binary content.

HTH
Stefan




Am 02.06.2006 um 21:10 schrieb Marco Pereira:

> Hi Everybody,
>
> I've got nutch to index images searching it's url and alt and title
> tags.
> But the problem comes when storing the thumbnails.
> I`ve indexed 3million images for a national search engine.
> I was in doubt wheter I use a file system scheme or a database to
> store the
> thumbnails.
> The thumbnails are created with a script that gets the image urls from
> nutch index doing a search for http (search.jsp?query=http).
>
> Do you have any tips, ideas on this?
>
> Thanks you,
> Marco

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com





Reply via email to