Re: new parse-html

Jack Tang Mon, 18 Apr 2005 02:18:34 -0700

Hi 

First of all, TAKE care of crawl-urlfilter.txt(if you use "crawl"
command)/regex-urlfilter.txt


# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
 |
V

-\.(css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$


/Jack

On 4/18/05, Marco Pereira <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I was trying to modify the files under src/plugin/parse-html in order to
> make it possible for nutch to index iamges(gif, jpeg, bmp, etc).
> I've injected some urls to some gif files and,after commenting
> the lines
> 
>    //if (!"".equals(contentType) && !contentType.startsWith("text/html"))
>    //  throw new ParseException("Content-Type not text/html: " +
> contentType);
> 
> I get the files indexed.
> 
> Ok, I know it's bad. But it's just a start.
> I'm trying to index only the urls so nutch can search on the iamge name at
> least.
> 
> But, the problem is how to follow <img src=> urls.
>  I've tried to add a new line here:
> 
>  public static HashMap linkParams = new HashMap();
> 
>  static {
>      linkParams.put("a", new LinkParams("a", "href", 1));
>      linkParams.put("img", new LinkParams("img", "src", 1));
> 
> but it didn't work.
> 
> My goal was to make nutch search for iamges, the way google does (nearly).
> So that parsing the iamge file won't be needed. Just indexing the iamge
> name,
> the page content, alt tags, etc.
> 
> Any suggestion? Help? Please, I would appreciate.
> 
> Thanks!
> Marco
> 
> _________________________________________________________________
> MSN Messenger: instale gr�tis e converse com seus amigos.
> http://messenger.msn.com.br
> 
>

Re: new parse-html

Reply via email to