On Nov 12, 2005, at 6:10 AM, José Castro wrote:
Hi, all.
You know the drill O:-)
Here's what the module will do:
- you give it a set of texts (usually two, but can be more)
- without knowing the language of each text, the module tells you how
likely it is that those texts are translations of each other
This is achieved by looking a number of things, from the length of the
text to the punctuation used. I already have a bunch of code, but in
the form of a script somewhere, so I'd like a name for this to rewrite
it and release as a module (it's also part of a bigger project).
As you probably know, texts like this are generally called "parallel
texts" or "parallel corpora". So I would definitely put "parallel"
somewhere in the name. "Translation" in the name would be informative,
but misleading, since the code won't actually do any translating.
As David pointed out, Lingua:: is generally the proper top-level home
for stuff like this.
So I'd think something like:
Lingua::ParallelDetection
or similar.
Another option, which sometimes happens in the Computational
Linguistics world, is to publish something about your algorithm,
describe it academically in sort of vaguely pseudocode-esque terms, and
give it a "clever" name like TRANS or ParaTexT. If enough people get
conversant with your algorithm (e.g. the Brill tagger or the Collins
parser or the Porter stemmer) then you can just release it as that name
under Lingua:: .
-Ken