[Dspace-tech] Dspace xpdf filter_Cyrillic text extraction

pchun Thu, 13 Mar 2014 02:52:28 -0700

Hi, everyone

My situation is as follows

1) I am trying to reconfigure Dspace to use xpdf media filter on my 4.1 test
installation. Install went smooth as far as I understand and I am able to
run filter-media without any error messages displayed.

2) Extraction works fine for English-language files. However extraction from
Russian-language (cyrillic) pdfs returns txt with a mess of unrecognizable
characters.

3) Strange thing is that I did set "textEncoding UTF-8" option in xpdfrc
config file for xpdf. So presumably txt files generated by xpdf should be ok
encoding-wise. To test it I run xpdf from command prompt on one of my
cyrillic pdfs. Output txt file was readable and utf-8-encoded as expected.
Later I uploaded this txt file to Dspace as ordinary bitstream for one of my
test items and opened it from Dspace with view/open. Browser displayed
unrecognizable characters with encoding autodetected as cyrillic-iso-8859-5.
Changing it manually to utf-8 returns expected text.

Any ideas on how to fix it?

Pavel Chunzhin

--
View this message in context:
http://dspace.2283337.n4.nabble.com/Dspace-xpdf-filter-Cyrillic-text-extraction-tp4672126.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

[Dspace-tech] Dspace xpdf filter_Cyrillic text extraction

Reply via email to