Hi Animesh, my wild guess is that N-gram profile for Chinese wasn't trained pretty well. Try recreate Chinese language profile.
Have a look here: http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html Hope it helps. On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann <[email protected]> wrote: > Hi Animesh, > > Please detail your issue here on [email protected] and I'm sure > someone can help. > > Cheers, > Chris > > > -----Original Message----- > From: Animesh Kumar <[email protected]> > Date: Wednesday, October 23, 2013 9:15 PM > To: "[email protected]" <[email protected]> > Subject: Fwd: Having Problem in Word Count and Language Detaction > > > > > > >Sir/Mam, > >I am developing a web based software which use Apache Tika for getting > >Language and words Count of Uploaded file. Its working fine for English, > >Japanese , Hindi etc but giving wrong words count for Chinese. I am using > >tika-app-1.4.jar . > >and there is an another problem in word counting of file format different > >from doc and docx > > > > > >-- > >With Thanks & Regards > >Animesh Kumar > >+918927992397 <tel:%2B918927992397> > > > > > > > > > > > > > > > >-- > >With Thanks & Regards > >Animesh Kumar > >+918927992397 <tel:%2B918927992397> > > > > > > >
