RE: Extensive web crawl

Webmaster Tue, 21 Oct 2008 23:34:22 -0700

I'm thinking I do not want any adult content at all in my system.  There's
more than enough of that out there on other engines.  Preferably I think I
would like to use regex to filter the content while crawling and then use an
additional filter combination within the search server itself to ensure that
no adult content.


Now will adding these 1000+ terms to regex slow down the parsing of the urls
excessively?

I think when this next crawl is finished I'll see about trying the
filtering..

Thanks..

Axel..

-----Original Message-----
From: Höchstötter Nadine [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 21, 2008 2:20 AM
To: [email protected]
Subject: AW: Extensive web crawl

Hi sorry, I think my other mail was too large. I have a regexp for such
trigger terms, I used those to filter "very" clean queries. But I think it
is better to crawl http://www.sex-lexis.com/ and extract the categories as
trigger terms.
Cheers, Nadine.

-----Ursprüngliche Nachricht-----
Von: Webmaster [mailto:[EMAIL PROTECTED]
Gesendet: Dienstag, 21. Oktober 2008 00:09
An: [email protected]
Betreff: RE: Extensive web crawl

Hi Otis,

So far so good..

I have things debugged and have 2TB of disk space to use now for building
the index.  Currently nutch is fetching about 12 million urls a day and the
index is at about 250 million urls.  Each fetch round takes about 2 hours
for 1-1.5 million urls.  It will be a few more weeks yet before I hit the 1
billion mark.  So far I have one searchable index built locally for testing
that contains 100m urls and it seems to work quite well and resonably fast
on a single processor P4 2.7ghz with 1.5GB ram.

In 2 weeks I'll be halfway there and have a 500m url index.  The next round
of fetching should be done tomorrow with another 100m links.

I must say I am very impressed with Hadoop and the ease of use factor.  I
never thought it would be so easy to add and remove nodes at will.  It makes
things pretty simple for upgrading and changin things around.  I can add
nodes on the fly while processes are running and it just keeps ticking
along.

I gave up on using the DMOZ for injections though.  To many bad urls and
dead sites in the index cause all kinds of errors.  I have since just
written a php script that I can point at a few sites here and there and it
scrapes all the urls into a nice list that I can cut and paste to a bigger
list.  By modifying the crawl-urlfilter.txt I have been able to get most
crawls to return 10m+ urls from an list of 100 injected sites crawling to a
depth of 40 (sites with lots of outlinks).

When this next round of fetching is done I'm going to inject 10m valid urls
from my fresh fetch lists and crawl to a depth of 10 to see what happens.
My guess is it will return about 200m urls, this should be an adequate
stress test of my sad cluster of outdated machines :)

I am however still looking into filtering the results for adult content
before I move it off the hadoop cluster and put it on the distributed search
nodes local file systems.

If all goes well I might put the sandbox up live for beta testing..

Axel..

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Monday, October 20, 2008 8:38 AM
To: [email protected]
Subject: Re: Extensive web crawl

Axel, how did this go?  I'd love to know if you got to 1B.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Webmaster <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Tuesday, October 7, 2008 1:13:29 AM
> Subject: Extensive web crawl
>
> Ok..
>
> So I want to index the web..  All of it..
>
> Any thoughts on how to automate this so I can just point the spider off on
> it's merry way and have it return 20 billion pages?
>
> So far I've been injecting random portions of the DMOZ mixed with other
urls
> like directory.yahoo.com and wiki.org.  I was hoping this would give me a
> good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was
replaced
> with *.* --  Perhaps this is my error and that should be left as is and
the
> last line should be +. instead of -. ?
>
> Anyhow after injecting 2000 urls and a few of my own I still only get back
> minimal results in the range of 500 to 600k urls.
>
> Right now I have a new grawl going with 1 million injected urls from the
> DMOZ, I'm thinking that this should return a 20 million page index at
> least..  No?
>
> Anyhow..  I have more HD space on the way and would like to get the
indexing
> up to 1 billion by the end of the week..
>
> Any examples on how to set up the url-filter.txt and regex-filter.txt
would
> be helpful..
>
> Thanks..
>
> Axel..

RE: Extensive web crawl

Reply via email to