Sent to you by Sean McBride via Google Reader: Anticipating the Next Generation of Search via Alt Search Engines by Guest Author on 9/30/08
Hank Williams Most of the world’s most important information has structure. To be clear, by structure I mean the data has separate fields for its component parts, like for example, a contact has separate fields for first name, last name, address city, country, etc. This structured information is where most of the value from the Internet resides. For example, all e-commerce is centered around structured data like product information records which have fields like part numbers, prices, descriptions, etc. I would suggest that structured data, on the whole, has one or more orders of magnitude greater economic value than unstructured data. And yet we currently have no centralized way to find structured information. Today the process is very ad-hoc. We must know where to look in order to find what we are looking for. When we want to buy a new TV perhaps we check Amazon, Best Buy, and Circuit City. If we want to find people to potentially hire, perhaps we go to LinkedIn. If we want to buy Beanie Babies we go to eBay. But shouldn’t it be possible to find structured information from a centralized source like Google? Yes, we have vertical search engines, but that really is the entirely wrong concept. There is nothing at all vertical (i.e. narrow) about structured search, and creating separate types of engines for each type of data is really the wrong thing to do. Perhaps we think of this problem as “vertical” just because it happens to be hard. But this is indeed one of the broadest remaining problems on the Internet. While the technical challenges are significant, the opportunity here is huge. This is because the direct economic value of structured data is, as I suggested above, *much* greater that that of raw text. And so the company that brings us structured search might have the potential to be *at least* as valuable as Google. Understanding the Problem The current form of the search engine makes lots of sense when the data you are searching is just a river of text, and all you are looking for is whether a set of words is present, or even with semantic search, whether a concept is present. But as the Internet becomes a web of structured data and you want to find records of a particular type with, for example, fields within a particular range, how will that work? The first thing to consider is that the Web, despite what people would like to think, is not really a collection of millions of independent servers operated by different people and companies. The web is a collection of servers that are all hooked into the major search engines — forming a kind of singular hive. These search engines operate as the brain of the Internet. They mirror a copy of most every bit of data on the Internet and index it inside their own servers. This mirroring is doable because the task is, at the most basic level, relatively simple: store text and build an inverted index of it. I don’t mean to minimize the implementation complexities of modern search, but the basic concept is very simple. Doing a web scale structured search engine is not nearly so conceptually simple. We have many years of experience storing structured information in SQL databases. But there are no web scale SQL databases comparable to the hundreds of thousands of servers Google has under the hood for storing and indexing text. And even if such a beast did exist, the whole concept of the SQL/relational database doesn’t work at web scale because you have to know what types of records you are going to store up front. You cannot just have every new user adding new record types. And yet this ability to search through and understand any kind of structure is *exactly* what you want a structured search engine to do. In the next generation of search, new structures must be as easy to add to the index as new web pages are to add to Google. Just as today’s search engines store any kind of text, tomorrows search engines must be able to grab structured information, understand it, and understand the relationships between structures. For example, you need to be able to ask your search engine, who are John Doe’s friends. You need to be able to tell it that John Doe is a person and to find all the people that are connected to John as friends. This will require a fundamental rethinking of what a search engine does. The solution to this is really a database problem. You need an infinitely horizontally database that understands structure but is not limited by it. And then you need some new kind of crawler to extract data. Ideally you also have some sort of notification system that allows this new search engine to be notified when individual records change. Getting There A broad-based solution to this problem is what I would call Web 3.0 search, and it will be necessary for Web 3.0 to reach its full potential in the same way search engines are critical to the current Web universe. But despite its importance and obviousness, I think this problem will not be solved by one of the major players, but by a startup. The major’s have too much work on their plates already, and it just makes more sense to let a focused startup figure all of this stuff out and to acquire it later. But once this nut is cracked, it will be possible to explore the world of information in a way that makes the current incarnation of Google seem almost silly. And I believe that creating such a search engine would provide the motivation for almost every holder of actionable, relevant data to make that data available in a form that is searchable by such an engine. I do believe this is an, “if you build it they will come” situation, because of the scale of problems that such a search engine solves, and because way before something like this got to be Google scale it would be invaluable. As an example, imagine being able to search for all of the flights between New York and anywhere, available on American Airlines that are below $200. From each of the flights you could click to see the cities associated with each flight. From each city you could see the top rated restaurants with prices under $20 a piece. From there you could explore their neighborhoods, etc. In many respects an interesting presage of this is Metaweb’s Freebase. Freebase allows you to explore data in much the way that I describe, but it is a database, and not a search engine. They present themselves as the structured version of Wikipedia and not the structured version of Google. In its present form, I think Freebase really needs to either become the next generation “structured data” search engine, or they need to hope that someone else invents it. Because without a good central search system users will just never think of Freebase. There is no doubt that such a structured search engine will come to pass. The need is too obvious and important. The most interesting question is what is the smallest possible implementation of this that actually does something useful will look like. Because while the need is clear, the most capital efficient way to get there never is. Things you can do from here: - Subscribe to Alt Search Engines using Google Reader - Get started using Google Reader to easily keep up with all your favorite sites