The textmining library (textmining.org) for Word docs should work fine with non-english text as well. Let me know if it doesn't
On 8/2/07, Ben Litchfield <[EMAIL PROTECTED]> wrote: > In terms of PDF documents... > > PDFBox should work just fine with any latin based languages; at this > time certain PDFs that have CJK characters can pose some issues. In > general english/french/spanish should be fine. > > Some PDFs use custom encodings that make it impossible to extract text > and it comes out as gibberish. As a simple test if Acrobat can > extract the text then PDFBox should be able to as well. > > Ben > > > Quoting Grant Ingersoll <[EMAIL PROTECTED]>: > > > Hey Michael, > > > > Have you given it a try? I would think they would work, but haven't > > actually done it. Setup a small test that reads in a PDF in French or > > Spanish and give it a try. You might have to worry about encodings or > > something, but the structure of the files should be the same, i.e. they > > are valid Word, etc. documents. > > > > -Grant > > > > On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote: > > > >> Yea, I have seen those. I guess the question is what do you all > >> use to extract text from Word, Excel, PPT and PDF? Can I use POI, > >> PDFBox and so on? This is what I use now to extract english. > >> > >> Thanks, > >> Michael > >> > >> testn wrote: > >>> If you can extract token stream from those files already, you can > >>> simply use > >>> different analyzers to analyze those token stream appropriately. Check out > >>> Lucen-contrib analyzers at > >>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ > >>> > >>> > >>> > >>> heybluez wrote: > >>> > >>>> I know how to do english text with POI and PDFBox and so on. Now, I want > >>>> to start indexing non-english language such as french and spanish. Which > >>>> extraction libs are available for me? > >>>> > >>>> I want to do: > >>>> > >>>> Excel > >>>> Word > >>>> PowerPoint > >>>> PDF > >>>> HTML > >>>> RTF > >>>> > >>>> Thanks! > >>>> Michael > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >> > > > > -------------------------- > > Grant Ingersoll > > http://lucene.grantingersoll.com > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]