Hi, On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <[email protected]> wrote: > I have ported PDFBox 1.1.0 on Android (only text extraction). The binary is > too big & too slow (probably due to memory constraints...) : around 5Mo (9Mo > once installed on a mobile device : too much)
See PDFBOX-586 [1] for some related progress. > Are the files in : > 1) cmap require ? (78-EUC_H Adobe-CNS-5 GBK-EUC-V UniKS-UTF8-H > ...) I would be please to remove all those files :-) These are only needed for processing PDF documents that use CJK (Chinese, Japanese, Korean) fonts. These CMaps are needed to translate from the internal font-specific character identification codes to Unicode. > 2) pdf_*.xml are they require for text extraction ? (pdf_he_IL.xml > pdf_zh_Hant.xml ....) These are part of the ICU4J library. You only need ICU4J for handling Arabic and other right-to-left languages. [1] https://issues.apache.org/jira/browse/PDFBOX-586 BR, Jukka Zitting
