This explains everything ;-) Well I am happy at least if this testing helps you to develop further Nutch, so I will be patient. Looking forward to testing this patch next week.
Greetings --- On Sat, 12/6/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Re: How to effectively stop indexing javascript pages ending with .js > To: nutch-user@lucene.apache.org > Date: Saturday, December 6, 2008, 2:46 PM > ML mail wrote: > > Dear Dennis > > > > I have got a workaround for the memory problem using a > normal Linux server and not a VPS anymore but I am now > encountering a new problem: starting with an initial crawl, > no pages get crawled as you can see here: > > > > crawl started in: /mnt/crawl_new > > rootUrlDir = /tmp/url > > threads = 70 > > depth = 1 > > topN = 100000 > > Injector: starting > > Injector: crawlDb: /mnt/crawl_new/crawldb > > Injector: urlDir: /tmp/url > > Injector: Converting injected urls to crawl db > entries. > > Injector: Merging injected urls into crawl db. > > Injector: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: segment: > /mnt/crawl_new/segments/20081206232112 > > Generator: filtering: true > > Generator: topN: 100000 > > Generator: jobtracker is 'local', generating > exactly one partition. > > Generator: 0 records selected for fetching, exiting > ... > > Stopping at depth=0 - no more URLs to fetch. > > No URLs to fetch - check your seed list and URL > filters. > > > > This time the urlfilter-domain got loaded successfully > and I have created a file called domain-urlfilter.txt in the > conf directory with just only the following two characters > inside "be" in order to index only domains ending > with the TLD be. The file which I use in <urlDir> > contains around 80000 urls all ending in .be. > > > The patch you have will only do either by hostname > "www.apache.org" or > domain name "apache.org". Hostname will only > allow urls with those > specific hostnames. Domain name will allow any urls with > that domain, > including subdomains. So apache.org would also handle > lucene.apache.org. The currently patch doesn't handle > ".com" for all > .com domains. But we are working on improving it to be > able to do so. > Should have that patch out Monday or Tuesday. > > Thanks for testing all of this stuff out. I know the > initial problems > can be slow going. > > Dennis > > > > So I don't really understand my problem here. > Could it be that the urlfilter-domain doesn't work for > TLD ? > > > > Thanks > > Regards > > > > > > --- On Fri, 12/5/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: How to effectively stop indexing > javascript pages ending with .js > >> To: nutch-user@lucene.apache.org > >> Date: Friday, December 5, 2008, 11:10 AM > >> Yeah, that is what I was mentioning before. The > SVN now > >> contains the > >> URLUtil patch. Patch everything except that one > file > >> because it has > >> already been updated with recent commits. > >> > >> Dennis > >> > >> ML mail wrote: > >>> As recommended I did an SVN checkout of Nutch > trunk > >> and patched it with your patch now I wanted to > build Nutch > >> using ant but it stucks as you can see here in the > output of > >> ant: > >>> Buildfile: build.xml > >>> > >>> init: > >>> [unjar] Expanding: > >> /opt/nutch/lib/hadoop-0.19.0-core.jar into > >> /opt/nutch/build/hadoop > >>> [untar] Expanding: > /opt/nutch/build/hadoop/bin.tgz > >> into /opt/nutch/bin > >>> [unjar] Expanding: > >> /opt/nutch/lib/hadoop-0.19.0-core.jar into > /opt/nutch/build > >>> compile-core: > >>> [javac] Compiling 207 source files to > >> /opt/nutch/build/classes > >>> [javac] > >> > /opt/nutch/src/java/org/apache/nutch/util/URLUtil.java:343: > >> getHost(java.lang.String) is already defined in > >> org.apache.nutch.util.URLUtil > >>> [javac] public static String > getHost(String url) > >> { > >>> [javac] ^ > >>> [javac] > >> > /opt/nutch/src/java/org/apache/nutch/util/URLUtil.java:360: > >> getPage(java.lang.String) is already defined in > >> org.apache.nutch.util.URLUtil > >>> [javac] public static String > getPage(String url) > >> { > >>> [javac] ^ > >>> [javac] Note: Some input files use or > override a > >> deprecated API. > >>> [javac] Note: Recompile with > -Xlint:deprecation > >> for details. > >>> [javac] Note: Some input files use > unchecked or > >> unsafe operations. > >>> [javac] Note: Recompile with > -Xlint:unchecked for > >> details. > >>> [javac] 2 errors > >>> > >>> BUILD FAILED > >>> /opt/nutch/build.xml:107: Compile failed; see > the > >> compiler error output for details. > >>> Total time: 6 seconds > >>> > >>> > >>> I am using jdk1.6.0_10. Do you know what's > wrong > >> here ? Somehow it doesn't like the new URLUtil > code or > >> so... > >>> Thanks > >>> Regards > >>> > >>> > >>> --- On Thu, 12/4/08, Dennis Kubes > >> <[EMAIL PROTECTED]> wrote: > >>>> From: Dennis Kubes > <[EMAIL PROTECTED]> > >>>> Subject: Re: How to effectively stop > indexing > >> javascript pages ending with .js > >>>> To: nutch-user@lucene.apache.org > >>>> Date: Thursday, December 4, 2008, 3:27 PM > >>>> idk, worked fine for me. Try to pull a > fresh svn > >> pull and > >>>> then do a > >>>> small test run see if that works for you > just for > >>>> comparison. > >>>> > >>>> Dennis > >>>> > >>>> ML mail wrote: > >>>>> This is great news about the 1.0 > release !! > >>>>> > >>>>> So in the mean time I have tried what > you said > >> and so > >>>> copied over the URLUtil.java file from SVN > to my > >> 0.9 > >>>> installation and patched the whole. Patch > now went > >> through > >>>> fine but still I can't see any sign of > the > >>>> urlfilter-domain plugin getting loaded. > hadoop.log > >> has no > >>>> entries for urlfilter-domain plugin... Do > I need > >> to compile > >>>> something or so ? > >>>>> Regards > >>>>> > >>>>> > >>>>> > >>>>> --- On Thu, 12/4/08, Dennis Kubes > >>>> <[EMAIL PROTECTED]> wrote: > >>>>>> From: Dennis Kubes > >> <[EMAIL PROTECTED]> > >>>>>> Subject: Re: How to effectively > stop > >> indexing > >>>> javascript pages ending with .js > >>>>>> To: nutch-user@lucene.apache.org > >>>>>> Date: Thursday, December 4, 2008, > 1:44 PM > >>>>>> Yup. I guess it would only work > with the > >> current > >>>> SVN. > >>>>>> We should have a stable 1.0 > release out > >> in the > >>>> next couple > >>>>>> of weeks that will hopefully > include this > >> patch. > >>>> In the > >>>>>> mean time you could hack it a > little and > >> copy the > >>>> URLUtil > >>>>>> file from svn into your current > build, > >> then apply > >>>> the patch. > >>>>>> (And actually I have been doing > some > >> commits so > >>>> if you just > >>>>>> pull the URLUtil.java from svn it > will > >> have > >>>> everything you > >>>>>> need). > >>>>>> > >>>>>> Dennis > >>>>>> > >>>>>> ML mail wrote: > >>>>>>> Thanks for the quick new > patch. I have > >> now > >>>> tried it > >>>>>> out but patch fails on the > >>>>>> > >> src/java/org/apache/nutch/util/URLUtil.java > >>>> because it > >>>>>> doesn't exist I suppose. I am > using > >> the stable > >>>> Nutch 0.9 > >>>>>> release so I guess your patch only > works > >> with an > >>>> SVN release > >>>>>> ? > >>>>>>> We would prefer to use a > stable > >> release but if > >>>> there > >>>>>> are no other choices let me know > and I > >> will > >>>> install an SVN > >>>>>> release of Nutch and apply this > patch. > >>>>>>> Thanks > >>>>>>> Regards > >>>>>>> > >>>>>>> > >>>>>>> --- On Thu, 12/4/08, Dennis > Kubes > >>>>>> <[EMAIL PROTECTED]> wrote: > >>>>>>>> From: Dennis Kubes > >>>> <[EMAIL PROTECTED]> > >>>>>>>> Subject: Re: How to > effectively > >> stop > >>>> indexing > >>>>>> javascript pages ending with .js > >>>>>>>> To: > nutch-user@lucene.apache.org > >>>>>>>> Date: Thursday, December > 4, 2008, > >> 7:15 AM > >>>>>>>> Try it again with the > latest > >> patch. You > >>>> will need > >>>>>> to build > >>>>>>>> with Java 6 if you are not > doing > >> so. My > >>>>>> plugin.includes looks like > >>>>>> > urlfilter-(domain|suffix|prefix)|... I did > >> a clean > >>>> build > >>>>>>>> from svn, applied the 2nd > patch > >> and did a > >>>>>> multistep fetch of > >>>>>>>> apache.org. Looks like it > works > >> good for > >>>> me. Let > >>>>>> me know if you continue to > >>>>>>>> have problems. > >>>>>>>> > >>>>>>>> Dennis > >>>>>>>> > >>>>>>>> ML mail wrote: > >>>>>>>>> Dear Dennis, > >>>>>>>>> > >>>>>>>>> I have now applied > this patch > >> to my > >>>> Nutch 0.9 > >>>>>> (stable) > >>>>>>>> installation using > "gpatch > >> -p0 < > >>>>>> patchfile" and > >>>>>>>> changed the > plugin.includes > >> parameter to > >>>> include > >> "urlfilter-(domain|suffix)" in > >>>> the > >>>>>>>> nutch-defaul.xml file. > >>>>>>>>> I gave it a try using > a fresh > >> new test > >>>> crawl > >>>>>> and index > >>>>>>>> but it somehow still > indexes other > >> top > >>>> level > >>>>>> domains. The > >>>>>>>> domain-urlfilter.txt file, > which I > >> have > >>>> located in > >>>>>> the conf > >>>>>>>> dir of Nutch, contains > only > >> "be" > >>>> in > >>>>>> order to index > >>>>>>>> only TLD ending with be. > >>>>>>>>> After checking the > hadoop.log > >> file and > >>>> greping > >>>>>> for > >>>>>>>> urlfilter-domain I noticed > that > >> the plugin > >>>>>> doesn't get > >>>>>>>> loaded, as it never > appears in the > >>>> logfile. So I > >>>>>> guess my > >>>>>>>> problem is that it > doesn't > >> even load > >>>> the > >>>>>> plugin. > >>>>>>>>> Based on my > description, did I > >> miss > >>>> something > >>>>>> ? Or do > >>>>>>>> I need to do something > else to get > >> it > >>>> working ? > >>>>>>>>> Thanks > >>>>>>>>> Regards > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> --- On Tue, 12/2/08, > Dennis > >> Kubes > >>>>>>>> <[EMAIL PROTECTED]> > wrote: > >>>>>>>>>> From: Dennis Kubes > >>>>>> <[EMAIL PROTECTED]> > >>>>>>>>>> Subject: Re: How > to > >> effectively > >>>> stop > >>>>>> indexing > >>>>>>>> javascript pages ending > with .js > >>>>>>>>>> To: > >> nutch-user@lucene.apache.org > >>>>>>>>>> Date: Tuesday, > December 2, > >> 2008, > >>>> 5:47 PM > >>>>>>>>>> Patch has been > posted to > >> JIRA for > >>>> the > >>>>>>>> DomainURLFilter > >>>>>>>>>> plugin. > >>>>>>>>>> > >>>>>>>>>> > >> https://issues.apache.org/jira/browse/NUTCH-668 > >>>>>>>>>> Dennis > >>>>>>>>>> > >>>>>>>>>> Dennis Kubes > wrote: > >>>>>>>>>>> Trying to get > a patch > >> posted > >>>> tonight. > >>>>>> Will > >>>>>>>> probably > >>>>>>>>>> be in the 1.0 > release yes. > >>>>>>>>>>> Dennis > >>>>>>>>>>> > >>>>>>>>>>> John Martyniak > wrote: > >>>>>>>>>>>> That > sounds like a > >> good > >>>> feature. > >>>>>> Will > >>>>>>>> this be in > >>>>>>>>>> the 1.0 release? > >>>>>>>>>>>> -John > >>>>>>>>>>>> > >>>>>>>>>>>> On Dec 2, > 2008, at > >> 5:17 > >>>> PM, Dennis > >>>>>> Kubes > >>>>>>>> wrote: > >>>>>>>>>>>>> John > Martyniak > >> wrote: > >>>>>>>>>>>>>> > That will > >> be > >>>> awesome. > >>>>>> Will there > >>>>>>>> be a > >>>>>>>>>> limit to the > number of > >> domains > >>>> that can be > >>>>>>>> included? > >>>>>>>>>>>>> Only > on what > >> can be > >>>> stored in > >>>>>> memory > >>>>>>>> in a set. > >>>>>>>>>> So technical > limit yes, > >> practical > >>>> limit, > >>>>>> guessing > >>>>>>>> a few > >>>>>>>>>> million domains. > >>>>>>>>>>>>> Dennis > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > -John > >>>>>>>>>>>>>> On > Dec 2, > >> 2008, at > >>>> 3:27 > >>>>>> PM, Dennis > >>>>>>>> Kubes > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > I am > >> in the > >>>> process of > >>>>>> writing > >>>>>>>> a > >>>>>>>>>> domain-urlfilter. > It will > >> allow > >>>> fetching > >>>>>> only > >>>>>>>> from a list > >>>>>>>>>> of top level > domains. > >> Should have > >>>> a patch > >>>>>> out > >>>>>>>> shortly. > >>>>>>>>>> Hopefully that > will help > >> you and > >>>> others > >>>>>> who are > >>>>>>>> wanting to > >>>>>>>>>> verticalize nutch. > >>>>>>>>>>>>>>> > Dennis > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > ML > >> mail wrote: > >> Dear > >>>> Dennis > >> Many > >>>> thanks for > >>>>>> your quick > >>>>>>>>>> response. Now > everything > >> is clear > >>>> and I > >>>>>> understand > >>>>>>>> why it > >>>>>>>>>> didn't work... > >>>>>>>>>>>>>>>> > I > >> will > >>>> still use > >>>>>> the > >>>>>>>>>> urlfilter-regex > plugin as > >> I would > >>>> like to > >>>>>> crawl > >>>>>>>> only domains > >>>>>>>>>> from a single top > level > >> domain but > >>>> as > >>>>>> suggested I > >>>>>>>> have added > >>>>>>>>>> the > urlfilter-suffix > >> plugin to > >>>> avoid > >>>>>> indexing > >>>>>>>> javascript > >>>>>>>>>> pages. In the past > I > >> already had > >>>>>> deactivated the > >>>>>>>> parse-js > >>>>>>>>>> plugin. So I am > now > >> looking > >>>> forward to the > >>>>>> next > >>>>>>>> crawls being > >>>>>>>>>> freed of stupid > file > >> formats like > >>>> js ;-) > >> Greetings > >>>> --- On > >>>>>> Tue, > >>>>>>>> 12/2/08, > >>>>>>>>>> Dennis Kubes > >>>> <[EMAIL PROTECTED]> > >>>>>> wrote: > >> From: > >>>> Dennis > >>>>>> Kubes > >>>>>>>>>> > <[EMAIL PROTECTED]> > >>>> Subject: Re: > >>>>>> How to > >>>>>>>>>> effectively stop > indexing > >>>> javascript pages > >>>>>> ending > >>>>>>>> with .js > >> To: > >> nutch-user@lucene.apache.org > >> Date: > >>>> Tuesday, > >>>>>>>> December 2, > >>>>>>>>>> 2008, 8:50 AM > >> ML > >>>> mail wrote: > >>>> Hello, > >> I > >>>> would > >>>>>> definitely > >>>>>>>> like > >>>>>>>>>> not to index any > >> javascript > >> pages, > >>>> this > >>>>>> means any > >>>>>>>> pages > >>>>>>>>>> ending with > >> ".js". So > >> for > >>>> this > >>>>>> purpose I > >>>>>>>> simply > >>>>>>>>>> edited the > >> crawl-urlfilter.txt > >> file > >>>> and > >>>>>> changed the > >>>>>>>> default > >>>>>>>>>> suffix list not to > be > >> parsed to > >> add > >>>> the .js > >>>>>> extension > >>>>>>>> so that > >>>>>>>>>> it looks like this > now: > >> # > >>>> skip > >>>>>> image and > >>>>>>>> other > >>>>>>>>>> suffixes we > can't yet > >> parse > >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$ > >> The > >>>> easiest > >>>>>> way IMO is > >>>>>>>> to use > >>>>>>>>>> prefix and suffix > >> urlfilters > >>>> instead regex > >>>>>>>> urlfilter. > >>>>>>>>>> Change > plugin.includes and > >> replace > >>>>>> urlfilter-regex with > >>>>>>>>>> > urlfilter-(prefix|suffix). > >> Then > >>>> in the > >>>>>> suffix-urlfilter.txt > >>>>>>>> file add > >>>>>>>>>> .js under .css in > web > >> formats. > >> Also > >>>> change > >>>>>>>> plugin.includes > >>>>>>>>>> from > parse-(text|html|js) > >> to be > >>>>>> parse-(text|html). > >>>>>> Unfortunately I > >>>>>>>> noticed > >>>>>>>>>> that javascript > pages are > >> still > >>>> getting > >>>>>> indexed. > >>>>>>>> So what > >>>>>>>>>> does this exactly > mean ? > >> Is > >>>>>> crawl-urlfilter.txt > >>>>>>>> not > >>>>>>>>>> working ? Did I > miss > >> something > >>>> maybe > >> ? > >> I > >>>> was also > >>>>>>>> wondering what > >>>>>>>>>> is the difference > between > >> these > >>>> two > >>>>>> files: > >>>>>>>> crawl-urlfilter.txt > >>>>>>>> regex-urlfilter.txt > >>>>>> crawl-urlfilter.txt > >>>>>>>> file is > >>>>>>>>>> used by the crawl > command. > >> The > >> regex, > >>>> suffix, > >>>>>> prefix, > >>>>>>>> and > >>>>>>>>>> other urlfilter > files and > >> plugins > >> are > >>>> used when > >>>>>> calling > >>>>>>>> commands > >>>>>>>>>> manually in > various tools. > >> Dennis > >> ? > >>>>>>>>>>>>>>>>>> > >>>> Many > >>>>>> thanks > >>>> Regards > >>>>>>>>>>>>>>>>>> > >>>>> > >>> > >>> > > > > > >