On Friday 14 November 2003 11:50, Chong, Herb wrote: > if you are handling inter correlation properly, then terms can't cross > sentence boundaries. if you are not paying attention to sentence > boundaries, then you are not following rules of linguistics.
Isn't that quite strict interpretation, however? There are many cases where linguistically separate sentences do have strong dependendies; in web world simple things like list items may be very closely related. Put another way; it may not be trivially easy to detect sentence boundaries, nor is it certain that what (from language viewpoint) is a boundary really is hard boundary from semantic perspective? And are there not varying levels of separation (sentences close to each other often are related, back references being common), not just one, between sentences? As to storing boundaries in index; am I naive if I suggested just marker tokens that could easily be used to mark boundaries (sentence, paragraph, section)? Code that uses that information would obviously need to know details of marking used, but would it be infeasible to use such in-band information? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
