Re: Image Search Engine Input

Andrzej Bialecki Mon, 26 Mar 2007 23:53:07 -0800

Steve Severance wrote:

So now that I have spent a few hours looking into how this works a lot more
deeply I am even more of a conundrum. The fetcher passes the contents of the
page to the parsers. It assumes that text will be output from the parsers.
For instance even the SWF parser returns text. For all binary data, images,
videos, music, etc... this is problematic. Potentially confounding the
problem even further in the case of music is that text and binary data can
come from the same file. Even if that is a problem I am not going to tackle

it.

Well, Nutch was originally intended as a text search engine. Lucene is atext search library, too - so all it knows is the plain text. If youwant to use Nutch/Lucene for searching you will need to bring your datato a plain text format - at least the parts that you want to search against.

Now, when it comes to metadata, or other associated binary data, I'msure we can figure out a way to store it outside the Lucene index, in asimilar way the original content and parseData is already stored outsideLucene indexes.


-------

I've been thinking about an extension to the current "segment" format,which would allow arbitrary parts to be created (and retrieved) - thisis actually needed to support a real-life application. It's a simpleextension of the current model. Currently segments consist of a fixednumber of pre-defined parts (content, crawl_generate, crawl_fetch,parse_data, parse_text). But it shouldn't be too difficult to extendsegment tools and NutchBean to handle segments consisting of these basicparts plus other arbitrary parts.

In your case: you could have an additional segment part that storespost-processed images in binary format (you already have the originalones in content/). Another example: we could convert PDF/DOC/PPT filesto HTML, and store this output in the "HTML preview" part.


So there are 3 choices for moving forward with an image search,

1. All image data can be encoded as strings. I really don't like that choice
since the indexer will index huge amounts of junk.
2. The fetcher can be modified to allow another output for binary data. This
I think is the better choice although it will be a lot more work. I am not
sure that this is possible with MapReduce since MapRunnable has only 1
output.

No, not really - the number of output files is defined in theimplementation of OutputFormat - but it's true that you can only set asingle output location (and then you have to figure out how you want toput various stuff relative to that single location). There are existingimplementations of OutputFormat-s that create more than 1 file at thesame time - see ParseOutputFormat.

3. Images can be written into another directory for processing. This would
need more work to automate but is probably non-issue.

I want to do the right thing so that the image search can eventually be in
the trunk. I don't want to have to change the way a lot of things work in
the process. Let me know what you all think.

I think we should work together on a proposed API changes to this"extensible part" interface, plus probably some changes to the ParseAPI. I can create a JIRA issue and provide some initial patches.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Image Search Engine Input

Reply via email to