Bruno Sant'Anna wrote:
I know splitting the paragraph into sentences is not trivial but I sincerely think that this way is better than sending the full paragraph when we are dealing with more than one language.

Why not using the language attribute for decided which grammar checker should receive the text and the span of the text? As I mentioned before, single words are not really important here, because they really are not in another language, and the only reason to mark them up as being in another language is spell checking which is not the same as grammar checking.

So what you could simply do is to implement the following behavior:

1. If the paragraph is in one language, send it to the grammar checker.
2. If the paragraph contains foreign chunks, send them to the appropriate grammar checker, if any, possibly setting the API flag "this_is_interspersed_with_another_language".

This should be also quite fast.

The additional reason is that grammar checker could *really* need the information about paragraph length (in many languages, too lengthy paragraphs are considered bad writing style) and paragraph content (in many languages, rhymes in the sentences that follow should be avoided if it's not poetry; in Polish, repeating the same word in several sentences in a row is considered a very bad writing style). Grammatik for WordPerfect already detects paragraphs which are too short. I'm currently thinking about implementing detector for the "do not repeat same word" rule in Polish, your proposed approach would make this thing really impossible. So this is not theory, this is how real world grammar checkers work.

BTW, multilingual documents are really less common, believe me, try Google ;)

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to