Re: How to effectively stop indexing javascript pages ending with .js

ML mail Sun, 07 Dec 2008 00:34:38 -0800

This explains everything ;-) Well I am happy at least if this testing helps you 
to develop further Nutch, so I will be patient. Looking forward to testing this 
patch next week.


Greetings 


--- On Sat, 12/6/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: How to effectively stop indexing javascript pages ending with .js
> To: nutch-user@lucene.apache.org
> Date: Saturday, December 6, 2008, 2:46 PM
> ML mail wrote:
> > Dear Dennis
> > 
> > I have got a workaround for the memory problem using a
> normal Linux server and not a VPS anymore but I am now
> encountering a new problem: starting with an initial crawl,
> no pages get crawled as you can see here:
> > 
> > crawl started in: /mnt/crawl_new
> > rootUrlDir = /tmp/url
> > threads = 70
> > depth = 1
> > topN = 100000
> > Injector: starting
> > Injector: crawlDb: /mnt/crawl_new/crawldb
> > Injector: urlDir: /tmp/url
> > Injector: Converting injected urls to crawl db
> entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment:
> /mnt/crawl_new/segments/20081206232112
> > Generator: filtering: true
> > Generator: topN: 100000
> > Generator: jobtracker is 'local', generating
> exactly one partition.
> > Generator: 0 records selected for fetching, exiting
> ...
> > Stopping at depth=0 - no more URLs to fetch.
> > No URLs to fetch - check your seed list and URL
> filters.
> > 
> > This time the urlfilter-domain got loaded successfully
> and I have created a file called domain-urlfilter.txt in the
> conf directory with just only the following two characters
> inside "be" in order to index only domains ending
> with the TLD be. The file which I use in <urlDir>
> contains around 80000 urls all ending in .be.
> > 
> The patch you have will only do either by hostname
> "www.apache.org" or 
> domain name "apache.org".  Hostname will only
> allow urls with those 
> specific hostnames.  Domain name will allow any urls with
> that domain, 
> including subdomains.  So apache.org would also handle 
> lucene.apache.org.  The currently patch doesn't handle
> ".com" for all 
> .com domains.  But we are working on improving it to be
> able to do so. 
> Should have that patch out Monday or Tuesday.
> 
> Thanks for testing all of this stuff out.  I know the
> initial problems 
> can be slow going.
> 
> Dennis
> 
> 
> > So I don't really understand my problem here.
> Could it be that the urlfilter-domain doesn't work for
> TLD ?
> > 
> > Thanks
> > Regards
> > 
> > 
> > --- On Fri, 12/5/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: How to effectively stop indexing
> javascript pages ending with .js
> >> To: nutch-user@lucene.apache.org
> >> Date: Friday, December 5, 2008, 11:10 AM
> >> Yeah, that is what I was mentioning before.  The
> SVN now
> >> contains the 
> >> URLUtil patch.  Patch everything except that one
> file
> >> because it has 
> >> already been updated with recent commits.
> >>
> >> Dennis
> >>
> >> ML mail wrote:
> >>> As recommended I did an SVN checkout of Nutch
> trunk
> >> and patched it with your patch now I wanted to
> build Nutch
> >> using ant but it stucks as you can see here in the
> output of
> >> ant:
> >>> Buildfile: build.xml
> >>>
> >>> init:
> >>>     [unjar] Expanding:
> >> /opt/nutch/lib/hadoop-0.19.0-core.jar into
> >> /opt/nutch/build/hadoop
> >>>     [untar] Expanding:
> /opt/nutch/build/hadoop/bin.tgz
> >> into /opt/nutch/bin
> >>>     [unjar] Expanding:
> >> /opt/nutch/lib/hadoop-0.19.0-core.jar into
> /opt/nutch/build
> >>> compile-core:
> >>>     [javac] Compiling 207 source files to
> >> /opt/nutch/build/classes
> >>>     [javac]
> >>
> /opt/nutch/src/java/org/apache/nutch/util/URLUtil.java:343:
> >> getHost(java.lang.String) is already defined in
> >> org.apache.nutch.util.URLUtil
> >>>     [javac]   public static String
> getHost(String url)
> >> {
> >>>     [javac]                        ^
> >>>     [javac]
> >>
> /opt/nutch/src/java/org/apache/nutch/util/URLUtil.java:360:
> >> getPage(java.lang.String) is already defined in
> >> org.apache.nutch.util.URLUtil
> >>>     [javac]   public static String
> getPage(String url)
> >> {
> >>>     [javac]                        ^
> >>>     [javac] Note: Some input files use or
> override a
> >> deprecated API.
> >>>     [javac] Note: Recompile with
> -Xlint:deprecation
> >> for details.
> >>>     [javac] Note: Some input files use
> unchecked or
> >> unsafe operations.
> >>>     [javac] Note: Recompile with
> -Xlint:unchecked for
> >> details.
> >>>     [javac] 2 errors
> >>>
> >>> BUILD FAILED
> >>> /opt/nutch/build.xml:107: Compile failed; see
> the
> >> compiler error output for details.
> >>> Total time: 6 seconds
> >>>
> >>>
> >>> I am using jdk1.6.0_10. Do you know what's
> wrong
> >> here ? Somehow it doesn't like the new URLUtil
> code or
> >> so...
> >>> Thanks
> >>> Regards
> >>>
> >>>
> >>> --- On Thu, 12/4/08, Dennis Kubes
> >> <[EMAIL PROTECTED]> wrote:
> >>>> From: Dennis Kubes
> <[EMAIL PROTECTED]>
> >>>> Subject: Re: How to effectively stop
> indexing
> >> javascript pages ending with .js
> >>>> To: nutch-user@lucene.apache.org
> >>>> Date: Thursday, December 4, 2008, 3:27 PM
> >>>> idk, worked fine for me.  Try to pull a
> fresh svn
> >> pull and
> >>>> then do a 
> >>>> small test run see if that works for you
> just for
> >>>> comparison.
> >>>>
> >>>> Dennis
> >>>>
> >>>> ML mail wrote:
> >>>>> This is great news about the 1.0
> release !!
> >>>>>
> >>>>> So in the mean time I have tried what
> you said
> >> and so
> >>>> copied over the URLUtil.java file from SVN
> to my
> >> 0.9
> >>>> installation and patched the whole. Patch
> now went
> >> through
> >>>> fine but still I can't see any sign of
> the
> >>>> urlfilter-domain plugin getting loaded.
> hadoop.log
> >> has no
> >>>> entries for urlfilter-domain plugin... Do
> I need
> >> to compile
> >>>> something or so ?
> >>>>> Regards
> >>>>>
> >>>>>
> >>>>>
> >>>>> --- On Thu, 12/4/08, Dennis Kubes
> >>>> <[EMAIL PROTECTED]> wrote:
> >>>>>> From: Dennis Kubes
> >> <[EMAIL PROTECTED]>
> >>>>>> Subject: Re: How to effectively
> stop
> >> indexing
> >>>> javascript pages ending with .js
> >>>>>> To: nutch-user@lucene.apache.org
> >>>>>> Date: Thursday, December 4, 2008,
> 1:44 PM
> >>>>>> Yup.  I guess it would only work
> with the
> >> current
> >>>> SVN.
> >>>>>>  We should have a stable 1.0
> release out
> >> in the
> >>>> next couple
> >>>>>> of weeks that will hopefully
> include this
> >> patch. 
> >>>> In the
> >>>>>> mean time you could hack it a
> little and
> >> copy the
> >>>> URLUtil
> >>>>>> file from svn into your current
> build,
> >> then apply
> >>>> the patch.
> >>>>>>  (And actually I have been doing
> some
> >> commits so
> >>>> if you just
> >>>>>> pull the URLUtil.java from svn it
> will
> >> have
> >>>> everything you
> >>>>>> need).
> >>>>>>
> >>>>>> Dennis
> >>>>>>
> >>>>>> ML mail wrote:
> >>>>>>> Thanks for the quick new
> patch. I have
> >> now
> >>>> tried it
> >>>>>> out but patch fails on the
> >>>>>>
> >> src/java/org/apache/nutch/util/URLUtil.java
> >>>> because it
> >>>>>> doesn't exist I suppose. I am
> using
> >> the stable
> >>>> Nutch 0.9
> >>>>>> release so I guess your patch only
> works
> >> with an
> >>>> SVN release
> >>>>>> ?
> >>>>>>> We would prefer to use a
> stable
> >> release but if
> >>>> there
> >>>>>> are no other choices let me know
> and I
> >> will
> >>>> install an SVN
> >>>>>> release of Nutch and apply this
> patch.
> >>>>>>> Thanks
> >>>>>>> Regards
> >>>>>>>
> >>>>>>>
> >>>>>>> --- On Thu, 12/4/08, Dennis
> Kubes
> >>>>>> <[EMAIL PROTECTED]> wrote:
> >>>>>>>> From: Dennis Kubes
> >>>> <[EMAIL PROTECTED]>
> >>>>>>>> Subject: Re: How to
> effectively
> >> stop
> >>>> indexing
> >>>>>> javascript pages ending with .js
> >>>>>>>> To:
> nutch-user@lucene.apache.org
> >>>>>>>> Date: Thursday, December
> 4, 2008,
> >> 7:15 AM
> >>>>>>>> Try it again with the
> latest
> >> patch.  You
> >>>> will need
> >>>>>> to build
> >>>>>>>> with Java 6 if you are not
> doing
> >> so.  My
> >>>>>> plugin.includes looks like
> >>>>>>
> urlfilter-(domain|suffix|prefix)|... I did
> >> a clean
> >>>> build
> >>>>>>>> from svn, applied the 2nd
> patch
> >> and did a
> >>>>>> multistep fetch of
> >>>>>>>> apache.org.  Looks like it
> works
> >> good for
> >>>> me.  Let
> >>>>>> me know if you continue to
> >>>>>>>> have problems.
> >>>>>>>>
> >>>>>>>> Dennis
> >>>>>>>>
> >>>>>>>> ML mail wrote:
> >>>>>>>>> Dear Dennis,
> >>>>>>>>>
> >>>>>>>>> I have now applied
> this patch
> >> to my
> >>>> Nutch 0.9
> >>>>>> (stable)
> >>>>>>>> installation using
> "gpatch
> >> -p0 <
> >>>>>> patchfile" and
> >>>>>>>> changed the
> plugin.includes
> >> parameter to
> >>>> include
> >> "urlfilter-(domain|suffix)" in
> >>>> the
> >>>>>>>> nutch-defaul.xml file.
> >>>>>>>>> I gave it a try using
> a fresh
> >> new test
> >>>> crawl
> >>>>>> and index
> >>>>>>>> but it somehow still
> indexes other
> >> top
> >>>> level
> >>>>>> domains. The
> >>>>>>>> domain-urlfilter.txt file,
> which I
> >> have
> >>>> located in
> >>>>>> the conf
> >>>>>>>> dir of Nutch, contains
> only
> >> "be"
> >>>> in
> >>>>>> order to index
> >>>>>>>> only TLD ending with be. 
> >>>>>>>>> After checking the
> hadoop.log
> >> file and
> >>>> greping
> >>>>>> for
> >>>>>>>> urlfilter-domain I noticed
> that
> >> the plugin
> >>>>>> doesn't get
> >>>>>>>> loaded, as it never
> appears in the
> >>>> logfile. So I
> >>>>>> guess my
> >>>>>>>> problem is that it
> doesn't
> >> even load
> >>>> the
> >>>>>> plugin.
> >>>>>>>>> Based on my
> description, did I
> >> miss
> >>>> something
> >>>>>> ? Or do
> >>>>>>>> I need to do something
> else to get
> >> it
> >>>> working ?
> >>>>>>>>> Thanks
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --- On Tue, 12/2/08,
> Dennis
> >> Kubes
> >>>>>>>> <[EMAIL PROTECTED]>
> wrote:
> >>>>>>>>>> From: Dennis Kubes
> >>>>>> <[EMAIL PROTECTED]>
> >>>>>>>>>> Subject: Re: How
> to
> >> effectively
> >>>> stop
> >>>>>> indexing
> >>>>>>>> javascript pages ending
> with .js
> >>>>>>>>>> To:
> >> nutch-user@lucene.apache.org
> >>>>>>>>>> Date: Tuesday,
> December 2,
> >> 2008,
> >>>> 5:47 PM
> >>>>>>>>>> Patch has been
> posted to
> >> JIRA for
> >>>> the
> >>>>>>>> DomainURLFilter
> >>>>>>>>>> plugin.
> >>>>>>>>>>
> >>>>>>>>>>
> >> https://issues.apache.org/jira/browse/NUTCH-668
> >>>>>>>>>> Dennis
> >>>>>>>>>>
> >>>>>>>>>> Dennis Kubes
> wrote:
> >>>>>>>>>>> Trying to get
> a patch
> >> posted
> >>>> tonight. 
> >>>>>> Will
> >>>>>>>> probably
> >>>>>>>>>> be in the 1.0
> release yes.
> >>>>>>>>>>> Dennis
> >>>>>>>>>>>
> >>>>>>>>>>> John Martyniak
> wrote:
> >>>>>>>>>>>> That
> sounds like a
> >> good
> >>>> feature. 
> >>>>>> Will
> >>>>>>>> this be in
> >>>>>>>>>> the 1.0 release?
> >>>>>>>>>>>> -John
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Dec 2,
> 2008, at
> >> 5:17
> >>>> PM, Dennis
> >>>>>> Kubes
> >>>>>>>> wrote:
> >>>>>>>>>>>>> John
> Martyniak
> >> wrote:
> >>>>>>>>>>>>>>
> That will
> >> be
> >>>> awesome. 
> >>>>>> Will there
> >>>>>>>> be a
> >>>>>>>>>> limit to the
> number of
> >> domains
> >>>> that can be
> >>>>>>>> included?
> >>>>>>>>>>>>> Only
> on what
> >> can be
> >>>> stored in
> >>>>>> memory
> >>>>>>>> in a set.
> >>>>>>>>>>  So technical
> limit yes,
> >> practical
> >>>> limit,
> >>>>>> guessing
> >>>>>>>> a few
> >>>>>>>>>> million domains.
> >>>>>>>>>>>>> Dennis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> -John
> >>>>>>>>>>>>>> On
> Dec 2,
> >> 2008, at
> >>>> 3:27
> >>>>>> PM, Dennis
> >>>>>>>> Kubes
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> I am
> >> in the
> >>>> process of
> >>>>>> writing
> >>>>>>>> a
> >>>>>>>>>> domain-urlfilter. 
> It will
> >> allow
> >>>> fetching
> >>>>>> only
> >>>>>>>> from a list
> >>>>>>>>>> of top level
> domains. 
> >> Should have
> >>>> a patch
> >>>>>> out
> >>>>>>>> shortly. 
> >>>>>>>>>> Hopefully that
> will help
> >> you and
> >>>> others
> >>>>>> who are
> >>>>>>>> wanting to
> >>>>>>>>>> verticalize nutch.
> >>>>>>>>>>>>>>>
> Dennis
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> ML
> >> mail wrote:
> >> Dear
> >>>> Dennis
> >> Many
> >>>> thanks for
> >>>>>> your quick
> >>>>>>>>>> response. Now
> everything
> >> is clear
> >>>> and I
> >>>>>> understand
> >>>>>>>> why it
> >>>>>>>>>> didn't work...
> >>>>>>>>>>>>>>>>
> I
> >> will
> >>>> still use
> >>>>>> the
> >>>>>>>>>> urlfilter-regex
> plugin as
> >> I would
> >>>> like to
> >>>>>> crawl
> >>>>>>>> only domains
> >>>>>>>>>> from a single top
> level
> >> domain but
> >>>> as
> >>>>>> suggested I
> >>>>>>>> have added
> >>>>>>>>>> the
> urlfilter-suffix
> >> plugin to
> >>>> avoid
> >>>>>> indexing
> >>>>>>>> javascript
> >>>>>>>>>> pages. In the past
> I
> >> already had
> >>>>>> deactivated the
> >>>>>>>> parse-js
> >>>>>>>>>> plugin. So I am
> now
> >> looking
> >>>> forward to the
> >>>>>> next
> >>>>>>>> crawls being
> >>>>>>>>>> freed of stupid
> file
> >> formats like
> >>>> js ;-)
> >> Greetings
> >>>> --- On
> >>>>>> Tue,
> >>>>>>>> 12/2/08,
> >>>>>>>>>> Dennis Kubes
> >>>> <[EMAIL PROTECTED]>
> >>>>>> wrote:
> >> From:
> >>>> Dennis
> >>>>>> Kubes
> >>>>>>>>>>
> <[EMAIL PROTECTED]>
> >>>> Subject: Re:
> >>>>>> How to
> >>>>>>>>>> effectively stop
> indexing
> >>>> javascript pages
> >>>>>> ending
> >>>>>>>> with .js
> >> To:
> >> nutch-user@lucene.apache.org
> >> Date:
> >>>> Tuesday,
> >>>>>>>> December 2,
> >>>>>>>>>> 2008, 8:50 AM
> >> ML
> >>>> mail wrote:
> >>>> Hello,
> >> I
> >>>> would
> >>>>>> definitely
> >>>>>>>> like
> >>>>>>>>>> not to index any
> >> javascript
> >> pages,
> >>>> this
> >>>>>> means any
> >>>>>>>> pages
> >>>>>>>>>> ending with
> >> ".js". So
> >> for
> >>>> this
> >>>>>> purpose I
> >>>>>>>> simply
> >>>>>>>>>> edited the
> >> crawl-urlfilter.txt
> >> file
> >>>> and
> >>>>>> changed the
> >>>>>>>> default
> >>>>>>>>>> suffix list not to
> be
> >> parsed to
> >> add
> >>>> the .js
> >>>>>> extension
> >>>>>>>> so that
> >>>>>>>>>> it looks like this
> now:
> >> #
> >>>> skip
> >>>>>> image and
> >>>>>>>> other
> >>>>>>>>>> suffixes we
> can't yet
> >> parse
> >>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
> >> The
> >>>> easiest
> >>>>>> way IMO is
> >>>>>>>> to use
> >>>>>>>>>> prefix and suffix
> >> urlfilters
> >>>> instead regex
> >>>>>>>> urlfilter. 
> >>>>>>>>>> Change
> plugin.includes and
> >> replace
> >>>>>> urlfilter-regex with
> >>>>>>>>>>
> urlfilter-(prefix|suffix).
> >>  Then
> >>>> in the
> >>>>>> suffix-urlfilter.txt
> >>>>>>>> file add
> >>>>>>>>>> .js under .css in
> web
> >> formats.
> >> Also
> >>>> change
> >>>>>>>> plugin.includes
> >>>>>>>>>> from
> parse-(text|html|js)
> >> to be
> >>>>>> parse-(text|html).
> >>>>>> Unfortunately I
> >>>>>>>> noticed
> >>>>>>>>>> that javascript
> pages are
> >> still
> >>>> getting
> >>>>>> indexed.
> >>>>>>>> So what
> >>>>>>>>>> does this exactly
> mean ?
> >> Is
> >>>>>> crawl-urlfilter.txt
> >>>>>>>> not
> >>>>>>>>>> working ? Did I
> miss
> >> something
> >>>> maybe
> >> ?
> >> I
> >>>> was also
> >>>>>>>> wondering what
> >>>>>>>>>> is the difference
> between
> >> these
> >>>> two
> >>>>>> files:
> >>>>>>>> crawl-urlfilter.txt
> >>>>>>>> regex-urlfilter.txt
> >>>>>> crawl-urlfilter.txt
> >>>>>>>> file is
> >>>>>>>>>> used by the crawl
> command.
> >>  The
> >> regex,
> >>>> suffix,
> >>>>>> prefix,
> >>>>>>>> and
> >>>>>>>>>> other urlfilter
> files and
> >> plugins
> >> are
> >>>> used when
> >>>>>> calling
> >>>>>>>> commands
> >>>>>>>>>> manually in
> various tools.
> >> Dennis
> >> ?
> >>>>>>>>>>>>>>>>>>
> >>>> Many
> >>>>>> thanks
> >>>> Regards
> >>>>>>>>>>>>>>>>>>
> >>>>>
> >>>
> >>>
> > 
> > 
> >

Re: How to effectively stop indexing javascript pages ending with .js

Reply via email to