Re: SemWeb and Google

David Huynh Wed, 07 Feb 2007 06:21:20 -0800

glenn mcdonald wrote:
> I don't think this really even counts as irony. The difference in  
> scalability between indexing raw HTTP response-strings, as they do  
> now, and running virtual browsers to execute arbitrary client-side  
> code on any page, as they'd have to do get the Exhibit-rendered  
> contents, is at this stage of web development pretty much  
> prohibitive. In the long term, it's a great idea to push for a  
> standard approach to providing a structured dataset alongside the  
> human-readable page and the change-trackable RSS/Atom feed (or maybe  
> eventually in place of the latter...).
>
> In the short term, though, you're going to have to feed Google data  
> it can understand. And I think any approach that involves trying to  
> somehow get the Exhibit-rendered HTML saved into a static form and  
> manually retrofitted back into the source page is going to be too  
> hard for most of the users you *want*, and probably too annoying even  
> where it's not too hard. You'd have to *redo* it every time you  
> change the data! Yuck.
>   
Agreed. Yuck indeed.


> So I think you have to go in the other direction.
>
> 1. Let people keep making their "default" pages exactly as they're making 
> them now.
> 2. Get 'em to use Piggy Bank to scrape their *existing* pages into RDF/JSON.
> 3. Have Exhibit run off of the scraper results, either by giving JSON files 
> back to the user or, even better, hosting them on an Exhibit server that can 
> also monitor the source pages (or their feeds) and automatically rescrape and 
> update the JSON files when the user changes the source data.
>
> What about *that* idea?
>   
That's not entirely beyond our reach. However, from my informal 
observations of SemWeb researchers at ISWC 2005 playing with Piggy Bank 
+ Semantic Bank, I think that if there is one too many steps in the work 
flow, people can get confused--even people who do SemWeb for a living! 
We still have not lived and breathed in a medium where structured data 
bits and pieces flow smoothly.

Furthermore, scraping handwritten HTML is not for everyone. A goal of 
Exhibit is to avoid scraping, especially scraping handwritten HTML.

Another option close to what you're suggesting is to get Babel to act as 
a proxy between Google and exhibits. As an author of an exhibit, I tell 
Babel to scrape and cache my exhibit as fully rendered HTML content that 
Google can crawl. But when a user is directed from a Google search to 
Babel, Babel will put up a big sign pointing over to my exhibit.

A nice side effect of this option is that we'll be hording exhibit data, 
hehe, for good not evil :-)

David

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: SemWeb and Google

Reply via email to