Hi Greg, > I am wondering if it would it be possible to integrate this kind of change > in the upstream code base?
Yes, of course. Please, open an issue in Jira. Ideally, with a patch attached, see: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing > I run Nutch 1.7 with large domain-urlfiler.txt (4M+ domains). > ... > because each instance of the inner class Selector in Generator creates new > instances of filters, normalizers and csfilters for each job. Is Nutch run in distributed or local mode (in Hadoop cluster or not)? Thanks, Sebastian On 03/30/2014 05:19 AM, Yavinty wrote: > Problem: > > I run Nutch 1.7 with large domain-urlfiler.txt (4M+ domains). Nutch throws > OutOfMemoryError no matter how much RAM is allocated to JVM. This is > because each instance of the inner class Selector in Generator creates new > instances of filters, normalizers and csfilters for each job. Considering > that when DomainURLFilter has a set of 4M+ strings and is created times > number of jobs, it gets quite big. > > Solution: > > A solution seems to be to initialize singleton instances of filters, > normalizers and csfilters in the top level Generator class and use them in > each instance of Selector. I made this change in my Nutch instance and > finally could pass the generation step with large set of URLs and domains. > > I am wondering if it would it be possible to integrate this kind of change > in the upstream code base? > > Thanks, > Greg >

