Re: SemWeb and Google

David Karger Wed, 07 Feb 2007 22:15:28 -0800

I think what Glen proposes is that my exhibit page should contain data 
in human-readable/authored html, from which json is constructed 
automatically, but then when someone visits the exhibit page, all that 
html I authored is replaced by the html generated by the exhibit 
javascript.   Speaking as someone who tried it, I personally find it 
much easier to manage my data in a json file than I would in a typical 
html page (this is why my html publications page is 3 years out of date 
but my json pubs page is up to date).   So, putting my data in html and 
having it autoscraped is backwards---it is sacrificing one of the major 
benefits of exhibit.

On the other hand, since I don't plan for that data to actually read by 
most human beings, it doesn't actually have to be pretty.   And, as 
David H. points out, it is risky to rely on scraping of arbitrary html.  
So, it makes sense to put the data in the page as "rigidly structured" 
html---something that can be parsed rather than scraped.  There are 
several options here: an html table, a microformat, or just a blob of 
json.  I like the json better because it is less verbose, not full of 
fiddly brackets and tags.  But if I want my document to be legal html 
then I will have to start using html escapes for some of the symbols 
that are perfectly normal in json; this will be annoying and mean the 
embedded json isn't legal json. Of course, if all I care about is google 
reading the page, I may not need to worry about the html being legal.

As I've argued before, there's no reason to pick here: with the 
previously mentioned army of programmers, we could support parsing of 
_all_ the formats I mention above.

However, there is one flaw in the above approach, that leaves me still 
interested in the idea of just letting xibit render the page and 
snapshotting the result.  As an exhibit gets big, one natural way to 
control its complexity is to have _multiple_ data files---eg one for a 
schema that you reuse on multiple exhibits, another for data of type 1, 
another for data of type 2, another with lat/long data collected from 
the googlemaps API.  Under the above scheme I will have to take all 
those files and mush them into the single exhibit html page.  Since this 
loses all my organization, obviously this mushed up data won't be my 
primary copy. Rather, I'll have separate files, I'll add new data to 
them, and I'll have some manual or automated process for reconstructing 
the all-in-one-page exhibit.

But if I am indeed going to be repeating a process for squeezing all my 
exhibits onto one page, what process could be easier than visiting the 
exhibit page and saying "file: save"----ie, the snapshot approach the 
Glenn doesn't like.  (I should also clarify that I don't see a need to 
get the exhibit snapshot "manually retrofitted back into the source 
page" as Glenn suggests; rather, I envision that exhibit snapshot would 
_be_ the _entire_ source page.  And that all I would have to do, each 
time I want to update it, is visit the source page (which contains the 
old data snapshot that would be instantly erased by the exhibit script 
and replaced with the up to date exhibit) and then click "save". This 
doesn't seem outlandish to me; it's no more than anyone does when they 
have a separate "authoring/proofreading" and "publishing" copy of a 
file---obviously, you want to look at your edits before they go public, 
so you keep a separate copy of the file, edit it, then copy it to the 
visibly published location when you are happy with it. It would be the 
same with xibit.

David Huynh wrote:
> glenn mcdonald wrote:
>   
>> I don't think this really even counts as irony. The difference in  
>> scalability between indexing raw HTTP response-strings, as they do  
>> now, and running virtual browsers to execute arbitrary client-side  
>> code on any page, as they'd have to do get the Exhibit-rendered  
>> contents, is at this stage of web development pretty much  
>> prohibitive. In the long term, it's a great idea to push for a  
>> standard approach to providing a structured dataset alongside the  
>> human-readable page and the change-trackable RSS/Atom feed (or maybe  
>> eventually in place of the latter...).
>>
>> In the short term, though, you're going to have to feed Google data  
>> it can understand. And I think any approach that involves trying to  
>> somehow get the Exhibit-rendered HTML saved into a static form and  
>> manually retrofitted back into the source page is going to be too  
>> hard for most of the users you *want*, and probably too annoying even  
>> where it's not too hard. You'd have to *redo* it every time you  
>> change the data! Yuck.
>>   
>>     
> Agreed. Yuck indeed.
>
>   
>> So I think you have to go in the other direction.
>>
>> 1. Let people keep making their "default" pages exactly as they're making 
>> them now.
>> 2. Get 'em to use Piggy Bank to scrape their *existing* pages into RDF/JSON.
>> 3. Have Exhibit run off of the scraper results, either by giving JSON files 
>> back to the user or, even better, hosting them on an Exhibit server that can 
>> also monitor the source pages (or their feeds) and automatically rescrape 
>> and update the JSON files when the user changes the source data.
>>
>> What about *that* idea?
>>   
>>     
> That's not entirely beyond our reach. However, from my informal 
> observations of SemWeb researchers at ISWC 2005 playing with Piggy Bank 
> + Semantic Bank, I think that if there is one too many steps in the work 
> flow, people can get confused--even people who do SemWeb for a living! 
> We still have not lived and breathed in a medium where structured data 
> bits and pieces flow smoothly.
>
> Furthermore, scraping handwritten HTML is not for everyone. A goal of 
> Exhibit is to avoid scraping, especially scraping handwritten HTML.
>
> Another option close to what you're suggesting is to get Babel to act as 
> a proxy between Google and exhibits. As an author of an exhibit, I tell 
> Babel to scrape and cache my exhibit as fully rendered HTML content that 
> Google can crawl. But when a user is directed from a Google search to 
> Babel, Babel will put up a big sign pointing over to my exhibit.
>
> A nice side effect of this option is that we'll be hording exhibit data, 
> hehe, for good not evil :-)
>
> David
>
> _______________________________________________
> General mailing list
> [email protected]
> http://simile.mit.edu/mailman/listinfo/general
>   
_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: SemWeb and Google

Reply via email to