Hi Greg,

> I am wondering if it would it be possible to integrate this kind of change
> in the upstream code base?

Yes, of course. Please, open an issue in Jira. Ideally, with a patch attached, 
see:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing

> I run Nutch 1.7 with large domain-urlfiler.txt (4M+ domains).
> ...
> because each instance of the inner class Selector in Generator creates new
> instances of filters, normalizers and csfilters for each job.

Is Nutch run in distributed or local mode (in Hadoop cluster or not)?

Thanks,
Sebastian


On 03/30/2014 05:19 AM, Yavinty wrote:
> Problem:
> 
> I run Nutch 1.7 with large domain-urlfiler.txt (4M+ domains). Nutch throws
> OutOfMemoryError no matter how much RAM is allocated to JVM. This is
> because each instance of the inner class Selector in Generator creates new
> instances of filters, normalizers and csfilters for each job. Considering
> that when DomainURLFilter has a set of 4M+ strings and is created times
> number of jobs, it gets quite big.
> 
> Solution:
> 
> A solution seems to be to initialize singleton instances of filters,
> normalizers and csfilters in the top level Generator class and use them in
> each instance of Selector. I made this change in my Nutch instance and
> finally could pass the generation step with large set of URLs and domains.
> 
> I am wondering if it would it be possible to integrate this kind of change
> in the upstream code base?
> 
> Thanks,
> Greg
> 

Reply via email to