Hi

Thank you for your feedback! Appreciate it.

Currently, there are no tools apart from the ones you have already
experimented with (topN and generate.min.score) to direct the crawl towards
the top scoring urls.

I wonder why did the generate.min.score did not work. I looked in to the
code and it turns out it uses a float value and not an integer. This line
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L162,
is parsing the score as a float. Could you try using a float value, and if
you already have and faced an error is it possible for you to share logs ?

Cheers,
Sujen




On Sat, Nov 26, 2016 at 8:53 PM, Dr. Brian Leverich <
[email protected]> wrote:

>
> Hi -
>
> I've used your cosine filter to score documents as I do
> narrowly-focused crawls, and it works extremely well.
>
> Thanks for developing it!
>
> I'm still a bit of a Nutch newbie, though, and I'm just
> not getting Nutch configured right to take full advantage
> of the scoring.
>
> I've tried using generate.min.score to drop low-scoring links,
> but it appears to be an integer value and hence not very useful
> for discriminating in conjunction with your cosine scorer.
>
> I've also adjusted TopN for each iteration of the crawl, but
> that requires a lot of manual effort.
>
> If I must, though, I'll script an adaptive TopN based on the
> range of scores and size of the current crawldb.
>
> Before I start rolling my own, though, have you published any
> guides or do you have any advice regarding how to best
> configure Nutch for using your cosine filter to explore a
> domain deeply?
>
> Thanks much for your thoughts!
>
> Cheers, B.
> [email protected]
>

Reply via email to