tika-user  

Parsing Romanian texts‏

Stefan-Alexandru Mirica
Wed, 17 Mar 2010 12:18:16 -0700

Hello Group,I'm
 using Tika 0.6 together with Lucene 3.0.1 to develop an application for
 school. Its purpose is, basically, parsing some msword and pdf 
documents written in Romanian, indexing and searching them. For the 
indexing and searching part of the project, I use the RomanianAnalyzer 
from LUCENE-424 (http://issues.apache.org/jira/browse/LUCENE-424). 
However, parsing the text with Tika is where the problem appears. For 
example, the text

"Mai târziu în cursul 
după-amiezei, Susan stătea tristă în cada de baie."

is
 read by the parser as

"Mai târziu în cursul dup?-amiezei, Susan 
st?tea trist? în cada de baie."

The content appears this way both
 when writing it to the standard output and writing it to a file. The 
so-called "problem-letters" are ă,ş and ţ.

From what I understood
 (please correct me if I'm wrong), I can bypass this problem by using 
the ParseContext context parameter of the parse() method. However, I 
have no idea how to tell it to parse the special Romanian characters.

I
 should mention that I'm using Windows 7 32-bit as an OS and my system 
locale (from Control Panel -> Region and Language) is set to 
Romanian.

Thanks for the help!
Alex.                                     
  • Parsing Romanian texts‏ Stefan-Alexandru Mirica