On Sep 17, 2007, at 12:44 PM, Paul Kinlan wrote:
I have created a C#/.Net Stream-based Microformat parser
(http://www.codeplex.com/microformat) and I am trying to create some
reference applications to show it off.

I am in the process of creating an "Operator" like plugin for IE (It
currently parses and displays the microformats that have been found on
a page).

One of the other ideas that I am toying with is a Microformat spider,
that crawls the web looking for microformats, storing them and then
allowing them to be searched.   My question is: How are people storing
the data present in microformats so that they can be searched and
maintained and consumed in applications etc?

In short, I use mysql tables, one for each microformat and one for each elemental type that can be many-to-many (images, photos, tags, etc) which then have polymorphic many-to-many relationships with the tables for the formats themselves.

We also build search indexes, currently using Ferret [http:// ferret.davebalmain.com/trac/], but hopefully soon switching our standard Lucene infrastructure at Technorati.

We cache all objects in memcache with indefinite timeouts (all cache clearing is done proactively). This includes all related items in one cache entry.

When it comes down to it, it's all a matter of scale. When we were indexing 10^5 and 10^6 items, we would actually parse some of the markup on the fly when someone did a search. Sounds crazy but it worked alright for awhile (I blame Tantek). Now we parse it all out into a relatively normalized model. We're at 10^8 or so items now. If we hit another order of magnitude we'll have to rethink things and probably take some stuff (like BLOBs) out of the relational database and put them somewhere else.

-ryan
_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

Reply via email to