Problem:

I run Nutch 1.7 with large domain-urlfiler.txt (4M+ domains). Nutch throws
OutOfMemoryError no matter how much RAM is allocated to JVM. This is
because each instance of the inner class Selector in Generator creates new
instances of filters, normalizers and csfilters for each job. Considering
that when DomainURLFilter has a set of 4M+ strings and is created times
number of jobs, it gets quite big.

Solution:

A solution seems to be to initialize singleton instances of filters,
normalizers and csfilters in the top level Generator class and use them in
each instance of Selector. I made this change in my Nutch instance and
finally could pass the generation step with large set of URLs and domains.

I am wondering if it would it be possible to integrate this kind of change
in the upstream code base?

Thanks,
Greg

Reply via email to