I am in the process of building a crawler. I did one before but I am not 
satisfied with it's performance. My crawler follows the following algorithm 
to index pages.

1 - Get URL to crawl.
2 - If the URL to crawl is in the index, extract the last modified date for 
the URL.
  2a - Extract last modified date from the header from the URL and compare 
the stored last modified date.
  2b - If the two dates match then stop the crawl.
3 - Extract the host name from the URL to crawl and request the 
'/robots.txt'.
4 - Go through the entries and check if crawler is allowed to crawl the URL 
given to crawl, if not then stop the crawl.
5 - Make HTTP request from host for the given URL.
6 - Extract the title, meta-description, meta-keywords and meta-refresh.
7 - If the meta-refresh is less or equal to 10 then crawl the URL given in 
the meta-refresh.
8 - Extract all links from the html file.
8 - Remove all HTML code from the html file received.
9 - Add the url, title, meta-description, meta-keywords, modified date and 
non-html part of html file to index.
10 - Crawl the links found on the page.

Hope the above was simple enough to understand. Now I have a few questions.

1] Some crawler HTTP requests that I have seen have used the 
IF-MODIFIED-SINCE entry to identify whether or not the url has been modified 
since that date. Should I cancel parts 2a and 2b and then add it to the HTTP 
request of part 6 instead. If so, then the received data after the request 
would be blank if the modified dates match or what exactly.
2] For example the url asked to be indexed is 
'http://bill.microsoft.com/index.html', is the robots.txt that I should be 
requesting is 'http://bill.microsoft.com/robots.txt' or is it at 
'htp://microsoft.com/robots.txt'. And if it is at 
'http://microsoft.com/robots.txt' then should I simple just remove what ever 
is before the first period in the host name or is there a perl function that 
can do this for me.
3] Some crawler HTTP requests that I have seen have used the FROM entry to 
give the email address of the creators of the crawler for contacting basis. 
Do you think that it is important to use this entry.
4] While reading through the guidelines to building a crawler, they stated 
that the crawler should include in the HTTP Referer entry the page it came 
from. By doing so this would help webmasters find broken links on their 
website when the crawler hit a 404 page. None of the crawler HTTP requests 
that I have seen have ever used this field to identify this, so do you think 
it is important to do it or not.
5] In part 7, I have stated that if the meta-refresh is less or equal to 10, 
then index the page given in the meta-refresh tag. Should I increase the 
number to 20 or 30 or is 10 enough or should I not listen to the 
meta-refresh at all and index the page anyway.
6] I would like to make a good index, so I want to know which part of an 
html page I should index. As seen above I index the url, title, 
meta-description, meta-keywords, modified-date and non-html part of the html 
file. Is this what I should index or is there more that I should index.
7] With regards to the index, should I store the url, title, 
meta-description, meta-keywords and modified-date in the database and have 
the non-html part of the html file stored in file in a directory which is 
referenced from the database or should I store the non-html part of the html 
file in the database also.
8] Google, Altavista and all the other search engines store each file they 
index in their index. Of course, I dont have gigabytes of space to store all 
of these files so I am only storing the non-html portion of the file. Do you 
think that I could forget the storing of the non-html portion of the html 
file and still have an acceptable searchable index.
9] I have heard that some crawlers index the alt attribute of the img tag. 
Is it important to do so?
10] Google doesn't show the meta-description underneath the page title on 
the search page, does it index the meta-description? I heard that Lycos or 
one search engine ignores the meta-description, so is it improtant to index 
it?

After finishing the crawler I am going to be working on the search engine 
part. Is there any algorithms available that could help me know how to sort 
the search results according to relevance. My first search engine simple 
sequencially went through the index and listed the records which had the 
keywords in the title first, followed by the ones which had the keywords in 
the description. I dont really understand now to know record relevance but 
know I cant use the link popularity feature because I am not indexing the 
links found on pages and dont think that it would make much of a difference 
for my small index.

Sorry for the lengthy email, but just like you, were all were beginners and 
need to learn from our elders.


Yousuf Philips

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to