Hi Viksit,

maybe you are looking for this thread:
http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465

Cheers,

Martin


PS: nutch-user is the correct option. nutch-agent is primarly for
site-owners who want to report misbehaving nutch bots.

On Jan 7, 2008 4:52 AM, Viksit Gaur <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I was trying to figure out the best method to crawl a site without
> getting any of the irrelevant bits such as flash widgets, javascript,
> links to ad networks, and others. The objective is to index all relevant
> textual data. (This may be extrapolated to other forms of data of course)
>
> My main question is - should this sort of elimination be done during the
> crawl, which would mean modifying the crawler; or should everything be
> crawled, indexed, and then have a text parsing system with some logic to
> extract the relevant bits?
>
> Using the crawl-urlfilter seems like the first option, but I believe it
> has its drawbacks. Firstly, it needs regexps which match URLs, which
> would have to be handwritten (even automated scripts would need human
> manipulation at some point). For instance, the scripts or images may be
> hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries
> are far apart to make automation tough. And any such customizations
> would need to be tailor made for each site crawled - a tall task. Is
> there a way to extend the crawler itself to do this? I remember seeing
> something on the list archives about extending the crawler, but I can't
> find it again anymore.. Any pointers?
>
> The second option was to write some sort of a custom class for the
> indexer (a form of the pluginexample on the wiki I guess).
>
> Either way, I'm not sure what the better method is. Any ideas would be
> appreciated!
>
> Cheers,
> Viksit
>
> PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure
> which one was a better option.
>

Reply via email to