Hi Michel,

>tractable way of generating useful web indexes. So far NLP has shown
>to be too time consuming and error-prone for a task this size
>(correct me if I'm wrong! NLP is not really my area). Ontology use

Certainly a lot of NLP systems that currently exist were not made for 
highly symmetric web spider type scenarios..  I think it's pretty 
hard work getting NLP to be truly effective on a single document, let 
alone the internet!  Conventional NLP methods of building syntactic 
structures/etc nearly always result in pretty high memory consumption 
as well as processing time.  Of course, this is because language is 
inherently ambiguous.

>Remember that the web is highly dynamic and HUGE. There are no
>standard protocols to receive messages when a new page is created or
>when the contents or address of a page have changed. So you have

I've just been thinking about this..  One method I was thinking that 
could help cut down refreshing an index is to check the file size.  
I'm guessing it would save a lot of bandwidth to first get the file 
size, and ignore it if it's the same size as recorded last time.  
This wouldn't work perfectly, but dealing with (as Google now 
reports) 3 billion pages probably won't be. :-)

>always to keep "browsing" and updating knowledge. My opinion is that
>maybe Google is the best you can get (well, the ranking scheme can
>always get a little better with some minor changes) when you want to
>treat all web pages. NLP and other processing methods can be used on
>top of this to generate something better, but the domain has to be

Yes, I totally agree.  Clearly Google does a pretty good job as it 
is.  I've not tried it myself but Google do offer an API for their 
search (for Java and .NET).  I wonder if that can bolt extra 
functionality on top..

>Link Discovery system. The objective of such system is to find
>patterns (spacial and time, hopefully seamlessly) in data that has
>been previously been pre-processed and stored using ontologies
>(DAML+OIL, mainly, with some extra things to enable pattern
>definition) as a data structure framework.

Ah right, interesting stuff..  What sort of data are you using?  Or I 
should say, what is the original source?

Paul.


_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to