Re: how to parse html files while crawling

Alexander Aristov Sun, 18 Apr 2010 22:09:32 -0700

Hi

your task is clear but solution is not simple. That is why there are so many
companies which are competing for users and try to show them relevant
results.

Nutch is not cleaver enough to sort out if a page an opinion or just an
advertisement. So you MUST yourself to teach it.

First of all you need to find out what distinguishes opinions from other
pages. It might be special structure, special tag or key word ....

Next you must rewrite HTML parser (indexer) and add necessary logic to
filter opinions form other pages.

so this's basically what you should do.

Best Regards
Alexander Aristov

On 12 April 2010 11:23, NareshG <nareshraj...@gmail.com> wrote:

>
> Hi ,
>
> I am also a newbie with nutch.
> Actually our requirement is to do opinion crawling.
> i.e we are looking to crawl only certail html pages which contain user
> opinions about products, items or movies, mail the page should contains
> opinions. And our seed list contains some review web sites like amazon,
> mouthsut and some other sites.
> so during fetching of html pages, how should i verify weather the html page
> is a opinion or not...
> Nutch experts(Nutch reasearches) put your comments and suggestions.
> Hope the requirement is clear..
>
> Thanks and regards,
> Naresh
> --
> View this message in context:
> http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: how to parse html files while crawling

Reply via email to