> Would you tell me where i can get help document on How to use NGramProfile > to train the > language identifier and how to detect it.
unfortunaly, there's no help document. Here is how to use the NGramProfile: java org.apache.nutch.analysis.lang.NGramProfile -create <profile-name> <filename> <encoding> Where: * profile-name is the ISO-639 language code (en, fr, de, ...) of the language profile you want to create (mr for Marathi) * filename is the name of the file you want to use to create the profile. * encoding is the encoding of the file names filename Once your profile is created, the detection part is done. Just add the languageidentifier plugin in your Nutch conf. Perform a crawl, and if all is working fine you should see a trace with something like: Analysis .... with analyzer ..... (language-code) Since you don't provide a specific analyzer associated to your new language code (mr), the default NutchAnalyzer will be used. Then create an Analyzer for Marathi by creating a new plugin (see for instance analysis-de or analysis-fr plugins provided in the Nutch source). Here is what must provide your plugin: * An analyzer extension that implements org.apache.nutch.analysis.NutchAnalyzer interface. * The plugin.xml descriptor of your plugin must declare the association between your analyzer and the language it should be used for. Something like: <implementation id="org.apache.nutch.analysis.mr.MarathiAnalyzer" class=" org.apache.nutch.analysis.mr.MarathiAnalyzer" lang="mr"/> Once this plugin is finished, just add it to the list of activated plugins in your configuration. Then the next time you perform a crawl, this analyzer will be used for documents identified as Marathi documents. > > Will it be OK if i use Stop Analyzer instead of NutchDocumentAnalyzer with > my custom stopwords? It's a first step to a language specific analyzer. where i have to make changes in Nutch code? As you can notice, there is no changes to do in the Nutch code. Just provide some more piece of code to plug in Nutch. If you can provide us feed-back on integrating Marathi in Nutch, it will be very appreciated. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
