Hi
First of all, TAKE care of crawl-urlfilter.txt(if you use "crawl"
command)/regex-urlfilter.txt
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
|
V
-\.(css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
/Jack
On 4/18/05, Marco Pereira <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I was trying to modify the files under src/plugin/parse-html in order to
> make it possible for nutch to index iamges(gif, jpeg, bmp, etc).
> I've injected some urls to some gif files and,after commenting
> the lines
>
> //if (!"".equals(contentType) && !contentType.startsWith("text/html"))
> // throw new ParseException("Content-Type not text/html: " +
> contentType);
>
> I get the files indexed.
>
> Ok, I know it's bad. But it's just a start.
> I'm trying to index only the urls so nutch can search on the iamge name at
> least.
>
> But, the problem is how to follow <img src=> urls.
> I've tried to add a new line here:
>
> public static HashMap linkParams = new HashMap();
>
> static {
> linkParams.put("a", new LinkParams("a", "href", 1));
> linkParams.put("img", new LinkParams("img", "src", 1));
>
> but it didn't work.
>
> My goal was to make nutch search for iamges, the way google does (nearly).
> So that parsing the iamge file won't be needed. Just indexing the iamge
> name,
> the page content, alt tags, etc.
>
> Any suggestion? Help? Please, I would appreciate.
>
> Thanks!
> Marco
>
> _________________________________________________________________
> MSN Messenger: instale gr�tis e converse com seus amigos.
> http://messenger.msn.com.br
>
>