Re: [Nutch-general] Image Search

TDLN Sat, 03 Jun 2006 01:44:22 -0700

BTW: the generation and storing of the thumbnails is done in the
ParseFilter. It is quite easy to retrieve the URLs to Image files from
the Outlinks using regular expressions. Then the generated file name
is added to the metadata to be later retrieved by the IndexingFilter.
No need for any seperate scripts.


Rgrds Thomas

On 6/3/06, TDLN <[EMAIL PROTECTED]> wrote:
> I am interested in developing such a solution as well.
>
> I am currently storing the thumbnails on the file system under a
> system generated name. My indexing plugin stores the filename in the
> index. Thumbnails are later served to the client by seperate Apache
> HTTP server. This required some changes but is otherwise pretty
> straight forward and performs very well for my current 300.000+
> images, around 15kb each.
>
> If you are developing the more "Nutch-like" solution I could
> contribute to that. For instance; I have some code that generates the
> thumbs using ImageJ that yields very good results.
>
> But I would definitely need some guidance in writing the hadoop map
> reduce job. we could even contribute this back and base a small
> tutorial on this work.
>
> What do you think?
>
> Rgrds, Thomas
>
> On 6/2/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> > Hi,
> > using search http is a bad idea, since you get many but not all pages.
> > Just write a hadoop map reduce job that process the fetched content
> > in your segments, that should be easy.
> > Storing images in a file system will be very slow as soon you have
> > too many.
> > I personal don't like databases since compared to nutch they are slow
> > as a snail.
> > For a other project also related to images I had created a own
> > ImageWritable that contained the binary data of a compressed image
> > compared with some meta data.
> > If you use a MapFile finding a image based on a key should be very
> > fast. I think much faster than a database with binary content.
> >
> > HTH
> > Stefan
> >
> >
> >
> >
> > Am 02.06.2006 um 21:10 schrieb Marco Pereira:
> >
> > > Hi Everybody,
> > >
> > > I've got nutch to index images searching it's url and alt and title
> > > tags.
> > > But the problem comes when storing the thumbnails.
> > > I`ve indexed 3million images for a national search engine.
> > > I was in doubt wheter I use a file system scheme or a database to
> > > store the
> > > thumbnails.
> > > The thumbnails are created with a script that gets the image urls from
> > > nutch index doing a search for http (search.jsp?query=http).
> > >
> > > Do you have any tips, ideas on this?
> > >
> > > Thanks you,
> > > Marco
> >
> > ---------------------------------------------
> > blog: http://www.find23.org
> > company: http://www.media-style.com
> >
> >
> >
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Image Search

Reply via email to