Dear Xiaotian Guo,

Here is a small Python program that does what you need (file separate.py
attached
 to this email).

This program requires Python 3 and installation of the regex module.
After installing Python3 you can install the regex module with pip as
follows:

    pip3 install regex

Then execute the program as follows (assuming your corpus is in a file
named corpus.txt):

python3 separate.py corpus.txt out.txt


I have included corpus.txt and out.txt with the examples that you gave.

The output file (out.txt) is a tab-separated file which you may open with
Excel or any
 other spreadsheet program or the like.

Cheers,

Luís Gomes





On Thu, Sep 22, 2016 at 10:52 AM, Xiaotian Guo <garlickf...@gmail.com>
wrote:

> Dear Corpora List members
>
>
> I am trying to build a small parallel corpus of English and (simplified)
> Chinese both for my research and a presentation for a conference. But I
> have a technical problem to solve now when I have the two languages mixed
> in one document and even mixed in different ways. I tried the tips and
> tricks in the internet but found none of them work properly. I feel a
> script of some programme like Perl or A language would solve the problem,
> but unfortunately I am not equipped with that advantage. So I would be very
> grateful if someone could do me a favour either by pointing to me an open
> source application programme (if there happen to be one somewhere) or
> writing me a script to separate the two languages neatly so that the
> parallel texts can be passed for alignment easily. I have tried the
> delimiter function of Excel but it won't solve the problem especially when
> the languages are mixed in more than one way.
>
>
> The ways the English and the Chinese are mixed in three different ways in
> a file:
>
>
> 1. The English is followed by the Chinese translation immediately without
> a hard return as follows:
>
> English English English English English English. 英文英文英文英文英文英文。
>
>
> 2. The English is followed by the Chinese translation with a hard return
> as follows:
>
> English English English English English English.
>
> 英文英文英文英文英文英文。
>
>
> 3. Sometimes the English is followed by the Chinese translation
> immediately without a hard return (mainly short sentences) and other times
> the English is followed by the Chinese translation with a hard return
> (mainly long sentences).
>
>
> Please accept my thanks in advance.
>
> Warm regards
>
> (Fred) Xiaotian Guo
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora@uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
English English English English English English. 
英文英文英文英文英文英文。

English English English English English English.
英文英文英文英文英文英文。
English English English English English English.        
英文英文英文英文英文英文。
English English English English English English.        
英文英文英文英文英文英文。
import regex
import sys

assert len(sys.argv) == 3
in_fname, out_fname = sys.argv[1:]

reg = regex.compile(
    r'''^(?P<en>[^\p{Script=Hani}]*)(?P<zh>(?:\p{Script=Hani}.*)?)$'''
)
en, zh = [], []
with open(in_fname) as lines, open(out_fname, "wt") as f:
    for m in map(reg.match, map(str.strip, lines)):
        matched_en = m.group("en")
        if matched_en:
            en.append(matched_en)
        matched_zh = m.group("zh")
        if matched_zh:
            zh.append(matched_zh)
        if en and zh:
            print(" ".join(en), " ".join(zh), sep="\t", file=f)
            en, zh = [], []
    if en and zh:
        print(" ".join(en), " ".join(zh), sep="\t", file=f)
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora@uib.no
http://mailman.uib.no/listinfo/corpora

Reply via email to