Re: Exhibits invisible to Google

David Karger Thu, 25 Jan 2007 08:10:33 -0800

Some additional observations:

With regard to google searches, having the data "out of page" is pretty 
much a non-starter.  Even if we can convince google to index the data 
page, the result will be searchers directed to the data page, which is 
clearly not the goal.  The only way around that would be to serve 
different things to spiders than to humans, which is exactly what 
spammers do and likely to get exhibits that do it banned from google.

So, the only real option I see is for the data to be in the page.  There 
are 3 obvious options here. Two are in-page embeddings by the editor, as 
json or html tables.  These have several downsides.  First, for this to 
be truly legitimate html it would have to have certain characters 
escaped.  This would make it harder for someone to edit.  Also, it 
prevents someone from being organized and linking to multiple distinct 
json files.  Finally, it doesn't work for data in a google spreadsheet 
or other external sources.  It does have the distinct advantage of 
making small exhibits even more portable---now there are only half as 
many files to copy and save :)

So, the alternative I prefer is getting a "save as" feature to work 
properly.  After exhibit renders the page, you have a document that 
contains all the data nicely formatted as html.  Thus, it is completely 
aboveboard as a page for google to index.  Also, it doesn't matter where 
the data comes from---internal, files, google spreadsheets---it all ends 
up in page ready to be indexed.  Of course, if someone visits the page, 
the preembedded data on the page is immediately discarded---it plays no 
role in the construction of the exhibit.  However, what exhibit 
constructs will be almost the same page, since it is filling in the same 
data.  So, I feel no qualms about google blacklisting for misbehavior.

All of these options do have another advantage.  They make the exhibit 
accessible to someone with js disabled.  Again, the last (saving 
post-exhibited html) makes the nicest presentation.

-David

David Huynh wrote:
> Hi all,
>
> Exhibit suffers from the same Achilles heel as other Ajax applications: 
> the dynamic content that gets inserted on-the-fly is totally invisible 
> to Google. My whole web site is now invisible to Google :-) Perhaps this 
> is the biggest impediment to adoption.
>
> Johan has added some code that allows Exhibit to load data from HTML 
> tables. This lets your data be shown even if Javascript is disabled and 
> lets your data be visible to Google. However, HTML tables are clunky to 
> store data.
>
> There is another alternative: inserting your data encoded as JSON 
> between <pre>...</pre> and then getting Exhibit to grab that text out 
> and eval(...) it. If Javascript is disabled, the data is displayed as 
> JSON--not so pretty.
>
> However, if the data is fed from another source, such as Google 
> Spreadsheets, then neither of these approaches can be used.
>
> We've also entertained the idea of using the browser's Save Page As... 
> feature to snapshot a rendered exhibit and then using that as the public 
> page. Exhibit still gets loaded into that page, but it would initially 
> not change the DOM until some user action requires it to. However, the 
> browser's Save Page As... feature doesn't do a very good job of saving 
> the generated DOM.
>
> So, I think anything we do would look pretty much like a hack and work 
> for only some cases. We also risk getting blacklisted by Google's 
> crawler. So, what do we do? Is it possible to ask Google to scrape those 
> exhibit-data links in the heads of the pages? And how do we do that?
>
> David
>
> _______________________________________________
> General mailing list
> [email protected]
> http://simile.mit.edu/mailman/listinfo/general
>   
_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: Exhibits invisible to Google

Reply via email to