Re: [Nutch-dev] Image Search Engine Input

Mathijs Homminga Mon, 26 Mar 2007 23:15:15 -0800

Hi Steve,

Good point.
We are also working on a image search. For the time being, we store the 
parsed content (a downscaled version of the image) by replacing the 
original content during parsing Not an ideal solution, I know!


My first reaction is that your 2nd suggestion is the way to go.

On the other hand. We prefer to have our images outside the segments so 
we can access and modify them more easily (fast retrieval at search time 
is a must (for presentation)). So we were thinking of some kind of image 
db using BerkleyDB SleepyCat (Oracle now).
Our indexer doesn't need the actual images themselves, it works on a 
fingerprint which is computed parse time and stored in the document's 
metadata as a string.

Mathijs

Steve Severance wrote:
> So now that I have spent a few hours looking into how this works a lot more
> deeply I am even more of a conundrum. The fetcher passes the contents of the
> page to the parsers. It assumes that text will be output from the parsers.
> For instance even the SWF parser returns text. For all binary data, images,
> videos, music, etc... this is problematic. Potentially confounding the
> problem even further in the case of music is that text and binary data can
> come from the same file. Even if that is a problem I am not going to tackle
> it. 
>
> So there are 3 choices for moving forward with an image search,
>
> 1. All image data can be encoded as strings. I really don't like that choice
> since the indexer will index huge amounts of junk.
> 2. The fetcher can be modified to allow another output for binary data. This
> I think is the better choice although it will be a lot more work. I am not
> sure that this is possible with MapReduce since MapRunnable has only 1
> output.
> 3. Images can be written into another directory for processing. This would
> need more work to automate but is probably non-issue.
>
> I want to do the right thing so that the image search can eventually be in
> the trunk. I don't want to have to change the way a lot of things work in
> the process. Let me know what you all think.
>
> Steve
>
>   
>> -----Original Message-----
>> From: Steve Severance [mailto:[EMAIL PROTECTED]
>> Sent: Monday, March 26, 2007 4:04 PM
>> To: [email protected]
>> Subject: Image Search Engine Input
>>
>> Hey all,
>> I am working on the basics of an image search engine. I want to ask for
>> feedback on something.
>>
>> Should I create a new directory in a segment parse_image? And then put
>> the
>> images there? If not where should I put them, in the parse_text? I
>> created a
>> class ImageWritable just like the Jira task said. This class contains
>> image
>> meta data as well as two BytesWritable for the original image and the
>> thumbnail.
>>
>> One more question, what ramifications does that have for the type of
>> Parse
>> that I am returning? Do I need to create a ParseImage class to hold it?
>> The
>> actual parsing infrastructure is something that I am still studying so
>> any
>> ideas here would be great. Thanks,
>>
>> Steve
>>     
>
>   

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Image Search Engine Input

Reply via email to