glenn mcdonald wrote: > I don't think this really even counts as irony. The difference in > scalability between indexing raw HTTP response-strings, as they do > now, and running virtual browsers to execute arbitrary client-side > code on any page, as they'd have to do get the Exhibit-rendered > contents, is at this stage of web development pretty much > prohibitive. In the long term, it's a great idea to push for a > standard approach to providing a structured dataset alongside the > human-readable page and the change-trackable RSS/Atom feed (or maybe > eventually in place of the latter...). > > In the short term, though, you're going to have to feed Google data > it can understand. And I think any approach that involves trying to > somehow get the Exhibit-rendered HTML saved into a static form and > manually retrofitted back into the source page is going to be too > hard for most of the users you *want*, and probably too annoying even > where it's not too hard. You'd have to *redo* it every time you > change the data! Yuck. > Agreed. Yuck indeed.
> So I think you have to go in the other direction. > > 1. Let people keep making their "default" pages exactly as they're making > them now. > 2. Get 'em to use Piggy Bank to scrape their *existing* pages into RDF/JSON. > 3. Have Exhibit run off of the scraper results, either by giving JSON files > back to the user or, even better, hosting them on an Exhibit server that can > also monitor the source pages (or their feeds) and automatically rescrape and > update the JSON files when the user changes the source data. > > What about *that* idea? > That's not entirely beyond our reach. However, from my informal observations of SemWeb researchers at ISWC 2005 playing with Piggy Bank + Semantic Bank, I think that if there is one too many steps in the work flow, people can get confused--even people who do SemWeb for a living! We still have not lived and breathed in a medium where structured data bits and pieces flow smoothly. Furthermore, scraping handwritten HTML is not for everyone. A goal of Exhibit is to avoid scraping, especially scraping handwritten HTML. Another option close to what you're suggesting is to get Babel to act as a proxy between Google and exhibits. As an author of an exhibit, I tell Babel to scrape and cache my exhibit as fully rendered HTML content that Google can crawl. But when a user is directed from a Google search to Babel, Babel will put up a big sign pointing over to my exhibit. A nice side effect of this option is that we'll be hording exhibit data, hehe, for good not evil :-) David _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
