Re: Text extraction : do we need those files ?

Jukka Zitting Tue, 10 Aug 2010 03:20:46 -0700

Hi,

On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <[email protected]> wrote:
> I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary is
> too big & too slow (probably due to memory constraints...) : around 5Mo   (9Mo
> once installed on a mobile device : too much)


See PDFBOX-586 [1] for some related progress.

> Are the files in :
> 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V   UniKS-UTF8-H
> ...)  I would be please to remove all those files :-)

These are only needed for processing PDF documents that use CJK
(Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
from the internal font-specific character identification codes to
Unicode.

> 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> pdf_zh_Hant.xml ....)

These are part of the ICU4J library. You only need ICU4J for handling
Arabic and other right-to-left languages.

[1] https://issues.apache.org/jira/browse/PDFBOX-586

BR,

Jukka Zitting

Re: Text extraction : do we need those files ?

Reply via email to