Re: Text extraction : do we need those files ?

Bernard Segonnes Tue, 10 Aug 2010 04:10:14 -0700

Thanks for the answer.

The PDFBOX-586 is from myself  :-)


So, as I expect to have customers in asian, and 'righ to left' countries : I
will keep those files :-(

(I sometimes have Out Of Memory Exception I should catch as my app. runs on
mobile devices/phones).  I will optimize elsewhere.

Selon Jukka Zitting <[email protected]>:

> Hi,
>
> On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <[email protected]> wrote:
> > I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary
> is
> > too big & too slow (probably due to memory constraints...) : around 5Mo  
> (9Mo
> > once installed on a mobile device : too much)
>
> See PDFBOX-586 [1] for some related progress.
>
> > Are the files in :
> > 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V  
> UniKS-UTF8-H
> > ...)  I would be please to remove all those files :-)
>
> These are only needed for processing PDF documents that use CJK
> (Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
> from the internal font-specific character identification codes to
> Unicode.
>
> > 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> > pdf_zh_Hant.xml ....)
>
> These are part of the ICU4J library. You only need ICU4J for handling
> Arabic and other right-to-left languages.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-586
>
> BR,
>
> Jukka Zitting
>


Bernard SEGONNES
-------------------------------------
[email protected]
http://bsegonnes.free.fr

Re: Text extraction : do we need those files ?

Reply via email to