Hi Fellow Crawler Creators,

 

I am a PhD student at George Mason University, majoring in Machine Learning and Data Mining under Dr. R. Michalski.  My long range goal is to transform text into useful knowledge (a semantic net).  I am also interested in the inference process (how to infer facts that are not overtly there, but can be implied through various inference processes).  Dr. Michalski's "Theory of Inferential Learning" defines 56 knowledge transmutations that are operations in a brain model that can derive new information from old information using things like deduction, induction, abduction, analogy, generalization, specialization, abstraction, concretion, agglomeration, dissection, etc.  (If you are interested in the details of this see http://www.mli.gmu.edu/papers/2003-2004/mli03-4.pdf)

 

I have created a really neat crawler.  It is written in C++ COM, and uses the WinInet calls. Because it is written in COM it can run on a single computer, or be spread over several of them.  Most of the component of the crawler can be instantiated or multi-tasked for additional performance.  Basically the system has two major components:

1.)  The Crawler itself (which transforms a few starter URLs into a database of millions of URLs to be crawled and millions of stored web pages.

2.)  The Analyzer which analyzes the web pages for content using Textual Data Mining concepts

 

The major components of the crawler are:

1)       Module to pull pages down from the www. (extracts header info also, ie, MIME, size, error codes, etc.)

2)       Module to pull pages in from the cache (if page is already on disk, will not pull again, unless page has changed)

3)       Module to extract links from page

4)       Module to filter links by MIME type or user criterion

5)       Module to strip out all html before passing it on to analysis system

 

I have some logic in my crawler that tries to spread out the sites I am crawling by using a random selection of links to visit, so that I spread out hitting a particular site too often.

 

My biggest unsolved problem is that I don't know how to test my system.

 

For example,  How do I know if I am pulling in all the pages that I should?

 

How do I know if I am correctly extracting all the links from each page?  (Besides links on html pages, there are links on MS Word pages, and other types of pages, some in somewhat different formats)

 

How do I know if my random selection of sites algorithm is working correctly?

 

Can any of you point me in the right direction?  Are any of you willing to share how you tested your own system?

 

The only partial solution that I have thought of is for me to create a web site with all varieties of pages on it, and to see if I can correctly read, and parse them.  But what I am worried about, is what about the things I do not think about.  New formats are coming out all the time.  How can I keep on top of this complex business?

 

Sincerely,

 

Norman Eugene White

 

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to