Akmal Sarhan wrote:
nice and fast ;-)

would be interesting though to know how you implemented the "summarizer".

Basically, the algorithm is statistics-based. It selects a configurable number of sentences from the text (in our case three) and, in case the sentences are too long, it cuts off after a configurable number of characters (in our case 350).


So the sentences you read are real sentences from the text, therefore I don't have to worry about syntax and can concentrate on semantics. The trick is to find sentences that are especially relevant to the query. There are a number of heuristics that can be employed, like the number of times the search terms appear or their position within the sentences.

Apart from that I've found that quality increases significantly, when the text is "cleaned up" before summarising. That involves identifying "real" sentences, so your summary doesn't start with "Home News Contact Sitemap" or somesuch. There's a lot of text on a web page, which is not content and therefore detrimental to the summariser (and I'm not even speaking about tags).

Also, highly structured documents are a problem (for example, search for "dissolution" on our website). Headlines are usually very important sentences that should score higher than regular sentences, but they are not easy to identify as sentences, because they don't end with a full-stop.

This "clean-up work" is actually trickier than the summarising itself and it is usually very domain-specific. That's the reason why I haven't proposed to contribute the summariser to Lucene, because the clean-up code is not generic. The summariser itself is just one class with 300 lines, but without prior clean-up the quality of its summaries is insufficient.

cheers,

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to