Hi all,

I was recently working on full-text indexing of some scanned/OCR'ed Arabic PDF files in DSpace. It always indexed the words in left-to-right for each word. As per the PDFBox (the plugin used by DSpace) documentation, 'sorting' (setSortByPosition) needs to be enabled in PDFTextStripper for proper handling of RTL text:

http://pdfbox.apache.org/userguide/text_extraction.html
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html

Based on these above, I've developed a patch (attached here), which works as expected in my demo instance. Appreciate any feedback.

Best regards,
Saiful

--
Saiful Amin, PhD
Visiting Scientist
Documentation Research and Training Centre
Indian Statistical Institute
8th Mile, Mysore Road
P.O. RV College
Bangalore - 560059.
Tel.: +91-80-2848 3002-6 Ext:331
Mob.: +91-9343826438

diff --git dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java 
dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java
index 7f07342..ec6927c 100644
--- dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java
+++ dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java
@@ -74,6 +74,7 @@ public class PDFFilter extends MediaFilter
         try
         {
             boolean useTemporaryFile = 
ConfigurationManager.getBooleanProperty("pdffilter.largepdfs", false);
+            boolean parseRTL = 
ConfigurationManager.getBooleanProperty("pdffilter.parseRTL", false);
 
             // get input stream from bitstream
             // pass to filter, get string back
@@ -94,6 +95,12 @@ public class PDFFilter extends MediaFilter
                 byteStream = new ByteArrayOutputStream();
                 writer = new OutputStreamWriter(byteStream);
             }
+
+           // parse RTL pdf files (e.g., Arabic)
+            if (parseRTL)
+           {
+               pts.setSortByPosition(true);
+           }
             
             try
             {
diff --git dspace/config/dspace.cfg dspace/config/dspace.cfg
index 50d826b..a4fbf99 100644
--- dspace/config/dspace.cfg
+++ dspace/config/dspace.cfg
@@ -425,6 +425,7 @@ 
filter.org.dspace.app.mediafilter.BrandedPreviewJPEGFilter.inputFormats = BMP, G
 # is slower, but helps ensure that PDFBox software DSpace uses doesn't eat up
 # all your memory
 #pdffilter.largepdfs = true
+pdffilter.parseRTL = true
 # If true, PDFs which still result in an Out of Memory error from PDFBox
 # are skipped over...these problematic PDFs will never be indexed until
 # memory usage can be decreased in the PDFBox software
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to