Hi Michel, >tractable way of generating useful web indexes. So far NLP has shown >to be too time consuming and error-prone for a task this size >(correct me if I'm wrong! NLP is not really my area). Ontology use
Certainly a lot of NLP systems that currently exist were not made for highly symmetric web spider type scenarios.. I think it's pretty hard work getting NLP to be truly effective on a single document, let alone the internet! Conventional NLP methods of building syntactic structures/etc nearly always result in pretty high memory consumption as well as processing time. Of course, this is because language is inherently ambiguous. >Remember that the web is highly dynamic and HUGE. There are no >standard protocols to receive messages when a new page is created or >when the contents or address of a page have changed. So you have I've just been thinking about this.. One method I was thinking that could help cut down refreshing an index is to check the file size. I'm guessing it would save a lot of bandwidth to first get the file size, and ignore it if it's the same size as recorded last time. This wouldn't work perfectly, but dealing with (as Google now reports) 3 billion pages probably won't be. :-) >always to keep "browsing" and updating knowledge. My opinion is that >maybe Google is the best you can get (well, the ranking scheme can >always get a little better with some minor changes) when you want to >treat all web pages. NLP and other processing methods can be used on >top of this to generate something better, but the domain has to be Yes, I totally agree. Clearly Google does a pretty good job as it is. I've not tried it myself but Google do offer an API for their search (for Java and .NET). I wonder if that can bolt extra functionality on top.. >Link Discovery system. The objective of such system is to find >patterns (spacial and time, hopefully seamlessly) in data that has >been previously been pre-processed and stored using ontologies >(DAML+OIL, mainly, with some extra things to enable pattern >definition) as a data structure framework. Ah right, interesting stuff.. What sort of data are you using? Or I should say, what is the original source? Paul. _______________________________________________ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots