Hi Mathias, thank you for getting back - let me give you an example from a monolingual EN corpora:
*Acoustic* measurement precision and uncertainty. Each press of the *Acoustic Output *– key decreases the transmission power setting (TX) displayed in the monitor display. In the first sentence the word Acoustic should not be exported. In the second sentence Acoustic Output should. Now I have written a program in Java that exports all the terms or group of terms with first capital letter, but this obviously includes the words like from the first example and it should not. The purpose is that the proper names only should be exported to a separate file. Best regards Mariusz 2017-07-04 10:02 GMT+02:00 Mathias Müller <[email protected]>: > Hi Mariusz > > What do you mean by “extracting” this content? What do you need the list > of proper names for? What are the languages involved? > > Regards, > Mathias > > — > > Mathias Müller > AND-2-20 > Institute of Computational Linguistics > University of Zurich > Switzerland > +41 44 635 75 81 <+41%2044%20635%2075%2081> > [email protected] > > On 4 Jul 2017, at 09:39, Mariusz Hawryłkiewicz < > [email protected]> wrote: > > Dear all, > > I have been searching for the most efficient way to extract untranslatable > content from the corpora that always begin from the capital letter (product > names etc.), the problem is that all the segments begin with the capital > letter and what's obvious, the sentence may also begin with the > untranslatable content (product name) :-). > > I want to avoid using common dictionaries to eliminate common words. > > Would you have any other suggestions? > > Thank you very much! > Mariusz > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
