Please take a look at the main branch for what we’re planning to do in Tika
2.0.0. The goal is to modularize parsers at a finer granularity.

We need all the testing/feedback we can get for a successful launch.

On Sat, Sep 26, 2020 at 2:19 PM Laurence Vanhelsuwe <[email protected]>
wrote:

> PDFBox did the trick for me. Thanks for the golden tip. :-)
>
> > On 26 Sep 2020, at 19:06, Dave Fisher <[email protected]> wrote:
> >
> > IIRC - if you know you only want PDF extraction then take a look at
> Apache PDFBox. PDFBox.apache.org
> >
> > Sent from my iPhone
> >
> >> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe <
> [email protected]> wrote:
> >>
> >> Thanks for the explanation.
> >>
> >> I understand the approach.. but in my particular use case, I cannot
> reasonably justify inflating my application size from 7Mb to 77Mb just to
> add functionality amounting to less than 1% of all functionality.
> >>
> >> I guess there’s no way to surgically extract just the PDF metadata
> parsing functionality from Tika ?
> >>
> >> Laurence
> >>
> >>> On 26 Sep 2020, at 18:04, Keith Bennett <[email protected]>
> wrote:
> >>>
> >>> Tika coordinates the use of many external-to-Tika parser libraries,
> each
> >>> with their own dependencies, for parsing many types of files. These
> parser
> >>> libraries are bundled into the tika-app jar file for your convenience.
> I
> >>> believe it's these libraries that make up the bulk of the download. For
> >>> example, if you unzip the jar file and inspect the contents, you can
> see
> >>> that just one of these parsers, poi, consists of 24 MB:
> >>>
> >>> % cd org/apache/poi
> >>>
> >>> % du -sh .
> >>> 24M .
> >>>
> >>> - Keith
> >>>
> >>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>> I found Tika during a quest to extract PDF metadata in Java. Did i
> screw
> >>>> up the JAR download, or is Tika really 70MB ?
> >>>>
> >>>> Kind regards,
> >>>>
> >>>> Laurence
> >>>>
> >>>>
> >>
> >
>
>

Reply via email to