Re: How to effectively stop indexing javascript pages ending with .js

ML mail Thu, 04 Dec 2008 13:35:33 -0800

Thanks for the quick new patch. I have now tried it out but patch fails on the 
src/java/org/apache/nutch/util/URLUtil.java because it doesn't exist I suppose. 
I am using the stable Nutch 0.9 release so I guess your patch only works with 
an SVN release ?


We would prefer to use a stable release but if there are no other choices let 
me know and I will install an SVN release of Nutch and apply this patch.

Thanks
Regards


--- On Thu, 12/4/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: How to effectively stop indexing javascript pages ending with .js
> To: nutch-user@lucene.apache.org
> Date: Thursday, December 4, 2008, 7:15 AM
> Try it again with the latest patch.  You will need to build
> with Java 6 
> if you are not doing so.  My plugin.includes looks like 
> urlfilter-(domain|suffix|prefix)|... I did a clean build
> from svn, 
> applied the 2nd patch and did a multistep fetch of
> apache.org.  Looks 
> like it works good for me.  Let me know if you continue to
> have problems.
> 
> Dennis
> 
> ML mail wrote:
> > Dear Dennis,
> > 
> > I have now applied this patch to my Nutch 0.9 (stable)
> installation using "gpatch -p0 < patchfile" and
> changed the plugin.includes parameter to include
> "urlfilter-(domain|suffix)" in the
> nutch-defaul.xml file.
> > 
> > I gave it a try using a fresh new test crawl and index
> but it somehow still indexes other top level domains. The
> domain-urlfilter.txt file, which I have located in the conf
> dir of Nutch, contains only "be" in order to index
> only TLD ending with be. 
> > 
> > After checking the hadoop.log file and greping for
> urlfilter-domain I noticed that the plugin doesn't get
> loaded, as it never appears in the logfile. So I guess my
> problem is that it doesn't even load the plugin.
> > 
> > Based on my description, did I miss something ? Or do
> I need to do something else to get it working ?
> > 
> > Thanks
> > Regards
> > 
> > 
> > 
> > --- On Tue, 12/2/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: How to effectively stop indexing
> javascript pages ending with .js
> >> To: nutch-user@lucene.apache.org
> >> Date: Tuesday, December 2, 2008, 5:47 PM
> >> Patch has been posted to JIRA for the
> DomainURLFilter
> >> plugin.
> >>
> >> https://issues.apache.org/jira/browse/NUTCH-668
> >>
> >> Dennis
> >>
> >> Dennis Kubes wrote:
> >>> Trying to get a patch posted tonight.  Will
> probably
> >> be in the 1.0 release yes.
> >>> Dennis
> >>>
> >>> John Martyniak wrote:
> >>>> That sounds like a good feature.  Will
> this be in
> >> the 1.0 release?
> >>>> -John
> >>>>
> >>>> On Dec 2, 2008, at 5:17 PM, Dennis Kubes
> wrote:
> >>>>
> >>>>>
> >>>>> John Martyniak wrote:
> >>>>>> That will be awesome.  Will there
> be a
> >> limit to the number of domains that can be
> included?
> >>>>> Only on what can be stored in memory
> in a set.
> >>  So technical limit yes, practical limit, guessing
> a few
> >> million domains.
> >>>>> Dennis
> >>>>>
> >>>>>> -John
> >>>>>> On Dec 2, 2008, at 3:27 PM, Dennis
> Kubes
> >> wrote:
> >>>>>>> I am in the process of writing
> a
> >> domain-urlfilter.  It will allow fetching only
> from a list
> >> of top level domains.  Should have a patch out
> shortly. 
> >> Hopefully that will help you and others who are
> wanting to
> >> verticalize nutch.
> >>>>>>> Dennis
> >>>>>>>
> >>>>>>> ML mail wrote:
> >>>>>>>> Dear Dennis
> >>>>>>>> Many thanks for your quick
> >> response. Now everything is clear and I understand
> why it
> >> didn't work...
> >>>>>>>> I will still use the
> >> urlfilter-regex plugin as I would like to crawl
> only domains
> >> from a single top level domain but as suggested I
> have added
> >> the urlfilter-suffix plugin to avoid indexing
> javascript
> >> pages. In the past I already had deactivated the
> parse-js
> >> plugin. So I am now looking forward to the next
> crawls being
> >> freed of stupid file formats like js ;-)
> >>>>>>>> Greetings --- On Tue,
> 12/2/08,
> >> Dennis Kubes <[EMAIL PROTECTED]> wrote:
> >>>>>>>>> From: Dennis Kubes
> >> <[EMAIL PROTECTED]>
> >>>>>>>>> Subject: Re: How to
> >> effectively stop indexing javascript pages ending
> with .js
> >>>>>>>>> To:
> >> nutch-user@lucene.apache.org
> >>>>>>>>> Date: Tuesday,
> December 2,
> >> 2008, 8:50 AM
> >>>>>>>>> ML mail wrote:
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> I would definitely
> like
> >> not to index any javascript
> >>>>>>>>> pages, this means any
> pages
> >> ending with ".js". So
> >>>>>>>>> for this purpose I
> simply
> >> edited the crawl-urlfilter.txt
> >>>>>>>>> file and changed the
> default
> >> suffix list not to be parsed to
> >>>>>>>>> add the .js extension
> so that
> >> it looks like this now:
> >>>>>>>>>> # skip image and
> other
> >> suffixes we can't yet parse
> >>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
> >>
> >>>>>>>>> The easiest way IMO is
> to use
> >> prefix and suffix urlfilters
> >>>>>>>>> instead regex
> urlfilter. 
> >> Change plugin.includes and replace
> >>>>>>>>> urlfilter-regex with
> >> urlfilter-(prefix|suffix).  Then in the
> >>>>>>>>> suffix-urlfilter.txt
> file add
> >> .js under .css in web formats.
> >>>>>>>>> Also change
> plugin.includes
> >> from parse-(text|html|js) to be
> >>>>>>>>> parse-(text|html).
> >>>>>>>>>
> >>>>>>>>>> Unfortunately I
> noticed
> >> that javascript pages are
> >>>>>>>>> still getting indexed.
> So what
> >> does this exactly mean ? Is
> >>>>>>>>> crawl-urlfilter.txt
> not
> >> working ? Did I miss something maybe
> >>>>>>>>> ?
> >>>>>>>>>> I was also
> wondering what
> >> is the difference between
> >>>>>>>>> these two files:
> >>>>>>>>>>
> crawl-urlfilter.txt
> >>>>>>>>>>
> regex-urlfilter.txt
> >>>>>>>>> crawl-urlfilter.txt
> file is
> >> used by the crawl command.  The
> >>>>>>>>> regex, suffix, prefix,
> and
> >> other urlfilter files and plugins
> >>>>>>>>> are used when calling
> commands
> >> manually in various tools.
> >>>>>>>>> Dennis
> >>>>>>>>>> ?
> >>>>>>>>>>
> >>>>>>>>>> Many thanks
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> > 
> > 
> >

Re: How to effectively stop indexing javascript pages ending with .js

Reply via email to