Hey Martin,

Thanks for the link - thats pretty close to what I was looking for, I'll
give it a shot! The discussion which lead to the thread you pointed out was
even better!

Cheers,
Viksit

On Jan 7, 2008 3:28 AM, Martin Kuen <[EMAIL PROTECTED]> wrote:

> Hi Viksit,
>
> maybe you are looking for this thread:
> http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465
>
> Cheers,
>
> Martin
>
>
> PS: nutch-user is the correct option. nutch-agent is primarly for
> site-owners who want to report misbehaving nutch bots.
>
> On Jan 7, 2008 4:52 AM, Viksit Gaur <[EMAIL PROTECTED]> wrote:
> > Hello all,
> >
> > I was trying to figure out the best method to crawl a site without
> > getting any of the irrelevant bits such as flash widgets, javascript,
> > links to ad networks, and others. The objective is to index all relevant
> > textual data. (This may be extrapolated to other forms of data of
> course)
> >
> > My main question is - should this sort of elimination be done during the
> > crawl, which would mean modifying the crawler; or should everything be
> > crawled, indexed, and then have a text parsing system with some logic to
> > extract the relevant bits?
> >
> > Using the crawl-urlfilter seems like the first option, but I believe it
> > has its drawbacks. Firstly, it needs regexps which match URLs, which
> > would have to be handwritten (even automated scripts would need human
> > manipulation at some point). For instance, the scripts or images may be
> > hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries
> > are far apart to make automation tough. And any such customizations
> > would need to be tailor made for each site crawled - a tall task. Is
> > there a way to extend the crawler itself to do this? I remember seeing
> > something on the list archives about extending the crawler, but I can't
> > find it again anymore.. Any pointers?
> >
> > The second option was to write some sort of a custom class for the
> > indexer (a form of the pluginexample on the wiki I guess).
> >
> > Either way, I'm not sure what the better method is. Any ideas would be
> > appreciated!
> >
> > Cheers,
> > Viksit
> >
> > PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure
> > which one was a better option.
> >
>

Reply via email to