Hi Harvey, I thought I should share with you my solution. I took the suggestions of this group and blended up a small drupal module which I think is quite cool.
This is how it flows: I will present an HTML page for each article, which is free to view (In some cases this will contain an abstract, and other times, it will not). The page will also hold the marketing blurb and information about subscribing or purchasing the article etc. I have previously extracted the text out of the pdf and put it into a database field (Its used in searching), so what I did was to alter the article pages by adding the noarchive metatag (so the page won't be cached), and then detecting the ip address of the user. If the IP address matches that of a bot (see http://iplists.com) then append the fulltext to the HTML page. I've also installed the xml sitemap module, which tracks changes on the website and lets the spiders know which pages have changed and need indexing I think this solution is robust and provides the best quality search results by google, while providing the subscription access control and guarding against the 403's Aaron Harvey Kane wrote: > Here's some reading material on how to detect Googlebot from when I > tinkered with cloaking about a year ago (things may well have changed > since then) - this method uses reverse DNS lookups, so gives better > accuracy than user-agent sniffing, or comparing to a whitelist of Google > IPs which are constantly changing. > > http://www.seofaststart.com/blog/google-proxy-hacking > http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html > http://www.seoegghead.com/blog/seo/how-to-guide-prevent-google-proxy-hacking-p210.html > http://www.seoegghead.com/blog/simplecloak-v2-php-implementation/ > > Our experience is that while this is probably the best way of detecting > Googlebot, it's not perfect. And here's the major drawback - you only > have to get it wrong once, and everything goes pear shaped. > > Say googlebot turns up, your cloaking script gets it wrong that 1 time > out of 100, and your website serves a 403 header instead of a 200. > Googlebot crawls 300 pages, and is now going to remove all that content > from the index like you told it to. > > Do give some consideration to the other alternative, which is to offer > the first page or section of the content in plain HTML for free, and > allow that to get indexed without any cloaking or trickery. Consumer > magazine used to do something like this, and I thought it a very > reasoned approach. Besides, PDFs in search results are just plain > annoying - I don't have any statistics, but I'd wager you get a way > better clickthrough rate, better visitor satisfaction and a lower bounce > rate by giving them HTML instead of PDF content. > > Hope that helps a little, I'd be interested to know if you do come up > with a better cloaking solution than what I have linked to above. > > Harvey. > > > [EMAIL PROTECTED] wrote: > >> I'm working on a project at the moment (written in Drupal) that contains a >> lot of content in the form of PDFs, which are only available to >> subscribers. I would like to expose the pdf content for Google to index >> (but not cache). I've seen google do this with some other websites. Has >> anyone done this before or can anyone recommend a way of going about this? >> Could IP-based authentication work? >> >> Aaron >> >> >> >> >> >> >> > > > ------------------------------------------------------------------------ > > > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.169 / Virus Database: 270.6.16/1650 - Release Date: 9/3/2008 > 4:13 PM > > --~--~---------~--~----~------------~-------~--~----~ NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [EMAIL PROTECTED] -~----------~----~----~----~------~----~------~--~---
