Bruno Sant'Anna wrote:
I know splitting the
paragraph into sentences is not trivial but I sincerely think that this
way is better than sending the full paragraph when we are dealing with
more than one language.
Why not using the language attribute for decided which grammar checker
should receive the text and the span of the text? As I mentioned before,
single words are not really important here, because they really are not
in another language, and the only reason to mark them up as being in
another language is spell checking which is not the same as grammar
checking.
So what you could simply do is to implement the following behavior:
1. If the paragraph is in one language, send it to the grammar checker.
2. If the paragraph contains foreign chunks, send them to the
appropriate grammar checker, if any, possibly setting the API flag
"this_is_interspersed_with_another_language".
This should be also quite fast.
The additional reason is that grammar checker could *really* need the
information about paragraph length (in many languages, too lengthy
paragraphs are considered bad writing style) and paragraph content (in
many languages, rhymes in the sentences that follow should be avoided if
it's not poetry; in Polish, repeating the same word in several sentences
in a row is considered a very bad writing style). Grammatik for
WordPerfect already detects paragraphs which are too short. I'm
currently thinking about implementing detector for the "do not repeat
same word" rule in Polish, your proposed approach would make this thing
really impossible. So this is not theory, this is how real world grammar
checkers work.
BTW, multilingual documents are really less common, believe me, try
Google ;)
Regards,
Marcin
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]