Having a image search component for nutch would be nice.
However I think we need to implement this as a kind of separated tool  
outside of the nutch code itself, since it is not 100 % integrateable  
into the nutch code.
(E.G. Nutch define one url == one index document.)
May be this would be a nice project for a nutch sandbox.
If you like you can open an issue to request a nutch sandbox project  
"image search".
If we got enough people vote for this issue we may have a chance to  
got it created.

Stefan

Am 03.06.2006 um 10:38 schrieb TDLN:

> I am interested in developing such a solution as well.
>
> I am currently storing the thumbnails on the file system under a
> system generated name. My indexing plugin stores the filename in the
> index. Thumbnails are later served to the client by seperate Apache
> HTTP server. This required some changes but is otherwise pretty
> straight forward and performs very well for my current 300.000+
> images, around 15kb each.
>
> If you are developing the more "Nutch-like" solution I could
> contribute to that. For instance; I have some code that generates the
> thumbs using ImageJ that yields very good results.
>
> But I would definitely need some guidance in writing the hadoop map
> reduce job. we could even contribute this back and base a small
> tutorial on this work.
>
> What do you think?
>
> Rgrds, Thomas
>
> On 6/2/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>> Hi,
>> using search http is a bad idea, since you get many but not all  
>> pages.
>> Just write a hadoop map reduce job that process the fetched content
>> in your segments, that should be easy.
>> Storing images in a file system will be very slow as soon you have
>> too many.
>> I personal don't like databases since compared to nutch they are slow
>> as a snail.
>> For a other project also related to images I had created a own
>> ImageWritable that contained the binary data of a compressed image
>> compared with some meta data.
>> If you use a MapFile finding a image based on a key should be very
>> fast. I think much faster than a database with binary content.
>>
>> HTH
>> Stefan
>>
>>
>>
>>
>> Am 02.06.2006 um 21:10 schrieb Marco Pereira:
>>
>> > Hi Everybody,
>> >
>> > I've got nutch to index images searching it's url and alt and title
>> > tags.
>> > But the problem comes when storing the thumbnails.
>> > I`ve indexed 3million images for a national search engine.
>> > I was in doubt wheter I use a file system scheme or a database to
>> > store the
>> > thumbnails.
>> > The thumbnails are created with a script that gets the image  
>> urls from
>> > nutch index doing a search for http (search.jsp?query=http).
>> >
>> > Do you have any tips, ideas on this?
>> >
>> > Thanks you,
>> > Marco
>>
>> ---------------------------------------------
>> blog: http://www.find23.org
>> company: http://www.media-style.com
>>
>>
>>
>



_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to