Hi Mathias, thank you for getting back - let me give you an example from a
monolingual EN corpora:

*Acoustic* measurement precision and uncertainty.
Each press of the *Acoustic Output *– key decreases the transmission power
setting (TX) displayed in the monitor display.

In the first sentence the word Acoustic should not be exported. In the
second sentence Acoustic Output should.
Now I have written a program in Java that exports all the terms or group of
terms with first capital letter, but this obviously includes the words like
from the first example and it should not.

The purpose is that the proper names only should be exported to a separate
file.

Best regards
Mariusz



2017-07-04 10:02 GMT+02:00 Mathias Müller <[email protected]>:

> Hi Mariusz
>
> What do you mean by “extracting” this content? What do you need the list
> of proper names for? What are the languages involved?
>
> Regards,
> Mathias
>
> —
>
> Mathias Müller
> AND-2-20
> Institute of Computational Linguistics
> University of Zurich
> Switzerland
> +41 44 635 75 81 <+41%2044%20635%2075%2081>
> [email protected]
>
> On 4 Jul 2017, at 09:39, Mariusz Hawryłkiewicz <
> [email protected]> wrote:
>
> Dear all,
>
> I have been searching for the most efficient way to extract untranslatable
> content from the corpora that always begin from the capital letter (product
> names etc.), the problem is that all the segments begin with the capital
> letter and what's obvious, the sentence may also begin with the
> untranslatable content (product name) :-).
>
> I want to avoid using common dictionaries to eliminate common words.
>
> Would you have any other suggestions?
>
> Thank you very much!
> Mariusz
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to