RE: [Robots] Post

Matthew Meadows Fri, 08 Nov 2002 19:34:18 -0800

Regarding this: 

What's there to invent after Google?

Quite a lot, actually.  Google has built a magnificent search portal 
for the Internet, but there's still room in the market for companies 
like Inktomi, Verity, DTSearch, AltaVista, and dozens of others big 
and small.  The reason is that search is an extremely rich problem 
domain, and different users have different search needs.  Searching 
source code, tagged documents, databases, log files, archives, LDAP 
servers, Usenet, and the Internet is a lot to ask of any single product. 
Google, AllTheWeb, and other free search engines are optimized for 
one aspect of the IR problem domain: returning relevancy scored results 
to queries into a massive index of web content.  Their business model 
is largely based on selling advertisements that correspond to keywords 
entered into a search page and providing a compelling portal for 
end users to link out to other sites, and the choices they've made in 
their indexing approach reflects that model.  However, many of these 
choices are not necessarily suitable for other aspects of the IR 
problem. 

For instance, most of these indexing algorithms for internet search are 
lossy, and the index administrators (or programmers) have determined the 
depth of the index.  The index relies on stop terms to keep it a manageable 
size, and the result sets include a fraction of results out of orders of 
billions, for good reason.  But these kind of constraints are not 
suitable for source code, log file, or legal document analysis.  Further, 
the types of weightings used in the relevancy scoring are not necessarily the 
same across different document repositories.  For instance, popularity based 
relevance has little bearing on corporate LANS full of ordinary business 
documents, and whereas keyword and metatag scoring have fallen out of 
favor with free public search engines they may be very effective parameters 
in scoring a query against a more controlled document repository. To truly 
create the most effective index possible requires the index administrator 
or an automated query optimizer to adjust the weightings of a wide range 
of variables that impact the size, depth, and effectiveness of the index. 

Consider also vertical searches, indexes optimized for a specific domain. 
A researcher in a particular discipline may benefit from having a clean index 
with a finely-honed affinity to that discipline.  Such indexes allow for 
a tremendous signal-to-noise ratio.  Imagine for example an index specific to 
Genetic Programming that contains daily traffic from message boards, 
Usenet messages and other online content intersected with information from 
your LAN, your inbox, your source code, and other proprietary sources.  
You can achieve an effective depth and breadth of content in such an index 
with far less resources than what would be required in a less discriminating 
database. 

Finally, don't forget about cost.  Last time I checked the enterprise 
versions of Google, AltaVista, and Inktomi - as far as I recall - all charge 
an escalating fee that corresponds to the number of documents indexed, a 
licensing model that may drastically increase the TCO of these solutions as the 
end user's business grows. 

I have built a discriminating filer that has most of these capabilities, and 
many more that I can't describe here.  That's why I never post, I've been busy 
working on the project on the side for over three years.  I can reveal more 
about it in the next couple of months after my management decides its level of 
interest in ownership of the code.  

It's good to see the activity on the mailing list today.  I suspect that a lot 
of people that would normally post are just busy working on their own robots, 
or just flat out lucky enough to be working. 

-----Original Message----- 
From: Paul Maddox [mailto:paulmdx@;hotpop.com] 
Sent: Friday, November 08, 2002 3:42 AM 
To: [EMAIL PROTECTED] 
Subject: Re: [Robots] Post 

Hi, 

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-) 

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider? 

Paul. 

On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: 
>Haven't seen traffic in ages. 
>I guess the theme's pretty much dead. 
> 
>What's there to invent after Google? 
> 
>    -h 
> 

_______________________________________________ 
Robots mailing list 
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots 

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

RE: [Robots] Post

Reply via email to