I am in the process of building a crawler. I did one before but I am not satisfied with it's performance. My crawler follows the following algorithm to index pages. 1 - Get URL to crawl. 2 - If the URL to crawl is in the index, extract the last modified date for the URL. 2a - Extract last modified date from the header from the URL and compare the stored last modified date. 2b - If the two dates match then stop the crawl. 3 - Extract the host name from the URL to crawl and request the '/robots.txt'. 4 - Go through the entries and check if crawler is allowed to crawl the URL given to crawl, if not then stop the crawl. 5 - Make HTTP request from host for the given URL. 6 - Extract the title, meta-description, meta-keywords and meta-refresh. 7 - If the meta-refresh is less or equal to 10 then crawl the URL given in the meta-refresh. 8 - Extract all links from the html file. 8 - Remove all HTML code from the html file received. 9 - Add the url, title, meta-description, meta-keywords, modified date and non-html part of html file to index. 10 - Crawl the links found on the page. Hope the above was simple enough to understand. Now I have a few questions. 1] Some crawler HTTP requests that I have seen have used the IF-MODIFIED-SINCE entry to identify whether or not the url has been modified since that date. Should I cancel parts 2a and 2b and then add it to the HTTP request of part 6 instead. If so, then the received data after the request would be blank if the modified dates match or what exactly. 2] For example the url asked to be indexed is 'http://bill.microsoft.com/index.html', is the robots.txt that I should be requesting is 'http://bill.microsoft.com/robots.txt' or is it at 'htp://microsoft.com/robots.txt'. And if it is at 'http://microsoft.com/robots.txt' then should I simple just remove what ever is before the first period in the host name or is there a perl function that can do this for me. 3] Some crawler HTTP requests that I have seen have used the FROM entry to give the email address of the creators of the crawler for contacting basis. Do you think that it is important to use this entry. 4] While reading through the guidelines to building a crawler, they stated that the crawler should include in the HTTP Referer entry the page it came from. By doing so this would help webmasters find broken links on their website when the crawler hit a 404 page. None of the crawler HTTP requests that I have seen have ever used this field to identify this, so do you think it is important to do it or not. 5] In part 7, I have stated that if the meta-refresh is less or equal to 10, then index the page given in the meta-refresh tag. Should I increase the number to 20 or 30 or is 10 enough or should I not listen to the meta-refresh at all and index the page anyway. 6] I would like to make a good index, so I want to know which part of an html page I should index. As seen above I index the url, title, meta-description, meta-keywords, modified-date and non-html part of the html file. Is this what I should index or is there more that I should index. 7] With regards to the index, should I store the url, title, meta-description, meta-keywords and modified-date in the database and have the non-html part of the html file stored in file in a directory which is referenced from the database or should I store the non-html part of the html file in the database also. 8] Google, Altavista and all the other search engines store each file they index in their index. Of course, I dont have gigabytes of space to store all of these files so I am only storing the non-html portion of the file. Do you think that I could forget the storing of the non-html portion of the html file and still have an acceptable searchable index. 9] I have heard that some crawlers index the alt attribute of the img tag. Is it important to do so? 10] Google doesn't show the meta-description underneath the page title on the search page, does it index the meta-description? I heard that Lycos or one search engine ignores the meta-description, so is it improtant to index it? After finishing the crawler I am going to be working on the search engine part. Is there any algorithms available that could help me know how to sort the search results according to relevance. My first search engine simple sequencially went through the index and listed the records which had the keywords in the title first, followed by the ones which had the keywords in the description. I dont really understand now to know record relevance but know I cant use the link popularity feature because I am not indexing the links found on pages and dont think that it would make much of a difference for my small index. Sorry for the lengthy email, but just like you, were all were beginners and need to learn from our elders. Yousuf Philips _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
