[jira] Commented: (LUCENE-826) Language detector

Karl Wettin (JIRA) Tue, 26 Jan 2010 05:35:03 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805027#action_12805027
 ]


Karl Wettin commented on LUCENE-826:
------------------------------------

Hi Ken,

it's hard for me to compare. I'll rant a bit about my experience from language 
detection though. 

I still haven't found a one strategy that works good on any text: a user query, 
a sentence, a paragraph or a complete document. 1-5 grams using SVM or NB works 
pretty good for them all but you really need to train it with the same sort of 
data you want to classify. Even when training with a mix of text lengths it 
tend to perform a lot worse than if you had one classifier for each data type. 
And you still probably want to twiddle with the classifier knobs to make it 
work great with the data you are classifying and training with.

In some cases I've used 1-10 grams and other times I've used 2-4 grams. 
Sometimes I've used SVM and other times I've used a simple desiction tree.

To sum it up, to achieve good quality I've always had to  build a classifier 
for that specific use case. Weka has a great test suite for figuring out what 
to use. Set it up, press play and return one week later to find out what to use.

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of 
> text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification 
> (logistic support vector models) feature selection and normalization of token 
> freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new 
> File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of 
> each registred language in the language to train. Above example pass this 
> test:
> (testEquals is the same as assertEquals, just not required. Only one of them 
> fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the 
> document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs 
> for now. I'll try do more work on considering the language trees when 
> classifying.
> It takes a bit of time and RAM to build the training data, so the patch 
> contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-826) Language detector

Reply via email to