Dear Corpora List members

I am trying to build a small parallel corpus of English and (simplified)
Chinese both for my research and a presentation for a conference. But I
have a technical problem to solve now when I have the two languages mixed
in one document and even mixed in different ways. I tried the tips and
tricks in the internet but found none of them work properly. I feel a
script of some programme like Perl or A language would solve the problem,
but unfortunately I am not equipped with that advantage. So I would be very
grateful if someone could do me a favour either by pointing to me an open
source application programme (if there happen to be one somewhere) or
writing me a script to separate the two languages neatly so that the
parallel texts can be passed for alignment easily. I have tried the
delimiter function of Excel but it won't solve the problem especially when
the languages are mixed in more than one way.


The ways the English and the Chinese are mixed in three different ways in a
file:


1. The English is followed by the Chinese translation immediately without a
hard return as follows:

English English English English English English. 英文英文英文英文英文英文。


2. The English is followed by the Chinese translation with a hard return as
follows:

English English English English English English.

英文英文英文英文英文英文。


3. Sometimes the English is followed by the Chinese translation immediately
without a hard return (mainly short sentences) and other times the English
is followed by the Chinese translation with a hard return (mainly long
sentences).


Please accept my thanks in advance.

Warm regards

(Fred) Xiaotian Guo
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora@uib.no
http://mailman.uib.no/listinfo/corpora

Reply via email to