RE: Image Search Engine Input

Steve Severance Tue, 27 Mar 2007 05:53:54 -0800

Hey guys. Thanks for the replies. 

> -----Original Message-----
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 27, 2007 3:52 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Image Search Engine Input
> 
> Steve Severance wrote:
> > So now that I have spent a few hours looking into how this works a
> lot more
> > deeply I am even more of a conundrum. The fetcher passes the contents
> of the
> > page to the parsers. It assumes that text will be output from the
> parsers.
> > For instance even the SWF parser returns text. For all binary data,
> images,
> > videos, music, etc... this is problematic. Potentially confounding
> the
> > problem even further in the case of music is that text and binary
> data can
> > come from the same file. Even if that is a problem I am not going to
> tackle
> > it.
> 
> 
> Well, Nutch was originally intended as a text search engine. Lucene is
> a
> text search library, too - so all it knows is the plain text. If you
> want to use Nutch/Lucene for searching you will need to bring your data
> to a plain text format - at least the parts that you want to search
> against.
> 
> Now, when it comes to metadata, or other associated binary data, I'm
> sure we can figure out a way to store it outside the Lucene index, in a
> similar way the original content and parseData is already stored
> outside
> Lucene indexes.


I am not looking to really make an image retrieval engine. During indexing 
referencing docs will be analyzed and text content will be associated with the 
image. Currently I want to keep this in a separate index. So despite the fact 
that images will be returned the search will be against text data.

> 
> -------
> 
> I've been thinking about an extension to the current "segment" format,
> which would allow arbitrary parts to be created (and retrieved) - this
> is actually needed to support a real-life application. It's a simple
> extension of the current model. Currently segments consist of a fixed
> number of pre-defined parts (content, crawl_generate, crawl_fetch,
> parse_data, parse_text). But it shouldn't be too difficult to extend
> segment tools and NutchBean to handle segments consisting of these
> basic
> parts plus other arbitrary parts.
> 
> In your case: you could have an additional segment part that stores
> post-processed images in binary format (you already have the original
> ones in content/). Another example: we could convert PDF/DOC/PPT files
> to HTML, and store this output in the "HTML preview" part.
> 

Then it would be possible for plugins to talk to additional directories. That 
would be great.

> 
> >
> > So there are 3 choices for moving forward with an image search,
> >
> > 1. All image data can be encoded as strings. I really don't like that
> choice
> > since the indexer will index huge amounts of junk.
> > 2. The fetcher can be modified to allow another output for binary
> data. This
> > I think is the better choice although it will be a lot more work. I
> am not
> > sure that this is possible with MapReduce since MapRunnable has only
> 1
> > output.
> 
> No, not really - the number of output files is defined in the
> implementation of OutputFormat - but it's true that you can only set a
> single output location (and then you have to figure out how you want to
> put various stuff relative to that single location). There are existing
> implementations of OutputFormat-s that create more than 1 file at the
> same time - see ParseOutputFormat.

Yeah I got that. I just don't want there to be another implementation that has 
to be maintained or add images directly into the output format. What happens 
when someone wants to do music or videos? Are we going to add those as well? I 
don't think that we should go down that road. But if I am wrong let me know.

> 
> 
> > 3. Images can be written into another directory for processing. This
> would
> > need more work to automate but is probably non-issue.
> >
> > I want to do the right thing so that the image search can eventually
> be in
> > the trunk. I don't want to have to change the way a lot of things
> work in
> > the process. Let me know what you all think.
> 
> I think we should work together on a proposed API changes to this
> "extensible part" interface, plus probably some changes to the Parse
> API. I can create a JIRA issue and provide some initial patches.
> 

I like Mathijs's suggestion about using a DB for holding thumbnails. I just 
want access to be in constant time since I am going to probably need to grab at 
least 10 and maybe 50 for each query. That can be kept in the plugin as an 
option or something like that. Does that have any ramifications for being run 
on Hadoop?

To sum up I think we are going to make an extensible interface to allow parse 
plugins to write to different directories other than the ones that currently 
exist. Please correct me if that is wrong.

Regards,

Steve

> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

RE: Image Search Engine Input

Reply via email to