Stefan-Alexandru Mirica
Wed, 17 Mar 2010 12:18:16 -0700
Hello Group,I'm using Tika 0.6 together with Lucene 3.0.1 to develop an application for school. Its purpose is, basically, parsing some msword and pdf documents written in Romanian, indexing and searching them. For the indexing and searching part of the project, I use the RomanianAnalyzer from LUCENE-424 (http://issues.apache.org/jira/browse/LUCENE-424). However, parsing the text with Tika is where the problem appears. For example, the text "Mai târziu în cursul după-amiezei, Susan stătea tristă în cada de baie." is read by the parser as "Mai târziu în cursul dup?-amiezei, Susan st?tea trist? în cada de baie." The content appears this way both when writing it to the standard output and writing it to a file. The so-called "problem-letters" are ă,ş and ţ. From what I understood (please correct me if I'm wrong), I can bypass this problem by using the ParseContext context parameter of the parse() method. However, I have no idea how to tell it to parse the special Romanian characters. I should mention that I'm using Windows 7 32-bit as an OS and my system locale (from Control Panel -> Region and Language) is set to Romanian. Thanks for the help! Alex.