Hi Dimitris,

can you elaborate on "changing the system locale to UTF-8"?
When I view bitstream of extracted text in a browser I cannot see special 
characters (e.g. Polish), but the search seems to work ok. I was wondering 
if that's expected or there is some way to set up text extraction to be in 
line with UTF-8 encoding. I use DSpace 6.3.

Best, Peter

W dniu poniedziałek, 24 sierpnia 2015 19:14:16 UTC+1 użytkownik Dimitrios 
A. Koutsomitropoulos napisał:
>
>
>
> It simply was a matter of changing the system locale to UTF-8. Of course 
> verbose output is still not readable (no ??? this time though), but 
> searching works ok. 
>
> Many thanks, 
>
> Dimitris 
>
> > -----Original Message----- 
> > From: Stuve, David H [mailto:david...@hp.com <javascript:>] 
> > Sent: Wednesday, November 17, 2004 9:22 PM 
> > To: Dimitrios A. Koutsomitropoulos; dspac...@lists.sourceforge.net 
> <javascript:> 
> > Subject: RE: [Dspace-tech] Media Filter does not work with 
> > non-english documents (?) 
> > 
> > Hi Dimitrios, 
> > 
> > Have you tried searching for Greek words that should be 
> > extracted and can't find them?  It is possible that the text 
> > extraction is working properly and the -verbose flag just 
> > isn't printing the extracted text correctly.  However, if 
> > search isn't finding the terms that should be there, then 
> > there is probably a bug in the encoding used by the filtering code. 
> > 
> > Dave 
> > 
> > -----Original Message----- 
> > From: dspace-t...@lists.sourceforge.net <javascript:> 
> > [mailto:dspace-t...@lists.sourceforge.net <javascript:>] On Behalf Of 
> > Dimitrios A. Koutsomitropoulos 
> > Sent: Monday, November 15, 2004 11:40 AM 
> > To: dspac...@lists.sourceforge.net <javascript:> 
> > Subject: [Dspace-tech] Media Filter does not work with 
> > non-english documents (?) 
> > 
> > 
> > 
> > 
> > Hello, 
> > 
> > I notice that the media filter facility and particulary the 
> > pdf and MS Word filtering does not work well with Greek documents. 
> > 
> > When executing filter-media in verbose mode I get a series of 
> > questionmarks 
> > (????) while english text shows correctly. 
> > 
> > I 've tried to run MediaFilterManager with the 
> > -Dfile.encoding = UTF-8 parameter but still... 
> > 
> > I also changed the PDFFilter.java method getDestinationStream to: 
> > 
> > 
> > public InputStream getDestinationStream(InputStream source) 
> >         throws Exception 
> >     { 
> >         // get input stream from bitstream 
> >         // pass to filter, get string back 
> >         PDFTextStripper pts = new PDFTextStripper(); 
> >         PDFParser parser = new PDFParser(source); 
> > 
> >         parser.parse(); 
> > 
> >         COSDocument cos = parser.getDocument(); 
> > 
> >         String extractedText = new String("UTF-8"); 
> >         String extractedText = pts.getText(parser.getDocument()); 
> >         extractedText = new String(extractedText.getBytes("UTF-8"), 
> > "UTF-8"); 
> > 
> >         // now close the pdf 
> >         cos.close(); 
> > 
> >         // if verbose flag is set, print out extracted text 
> >         // to STDOUT 
> >         if( MediaFilterManager.isVerbose ) 
> >                { 
> >                 System.out.println(extractedText); 
> >                } 
> > 
> >         // generate an input stream with the extracted text 
> >         byte[] textBytes = extractedText.getBytes("UTF-8"); 
> >         ByteArrayInputStream bais = new 
> > ByteArrayInputStream(textBytes); 
> > 
> >         return bais;  // will this work? or will the byte 
> > array be out of scope? 
> >     } 
> > 
> > But no luck. 
> > 
> > 
> > Is this the expected behavior or is there a workaround? 
> > 
> > 
> > Many thanks, 
> > 
> > 
> > Dimitrios A. Koutsomitropoulos, M.Sc. 
> > 
> > Computer & Informatics Engineer 
> > Postgraduate Researcher 
> > High Performance Information Systems Laboratory 
> > 
> >  Contact 
> >  e-mail: kots...@hpclab.ceid.upatras.gr <javascript:> 
> >  work:  +30 2610 993805 
> >  fax:    +30 2610 997706 
> >  http://www.hpclab.ceid.upatras.gr 
> > 
> > 
> > 
> > 
> > ------------------------------------------------------- 
> > This SF.Net email is sponsored by: InterSystems CACHE FREE 
> > OODBMS DOWNLOAD - A multidimensional database that combines 
> > robust object and relational technologies, making it a 
> > perfect match for Java, 
> > C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 
> > _______________________________________________ 
> > DSpace-tech mailing list 
> > dspac...@lists.sourceforge.net <javascript:> 
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech 
> > 
> > 
>
>
>
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to