Please take a look at the main branch for what we’re planning to do in Tika 2.0.0. The goal is to modularize parsers at a finer granularity.
We need all the testing/feedback we can get for a successful launch. On Sat, Sep 26, 2020 at 2:19 PM Laurence Vanhelsuwe <[email protected]> wrote: > PDFBox did the trick for me. Thanks for the golden tip. :-) > > > On 26 Sep 2020, at 19:06, Dave Fisher <[email protected]> wrote: > > > > IIRC - if you know you only want PDF extraction then take a look at > Apache PDFBox. PDFBox.apache.org > > > > Sent from my iPhone > > > >> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe < > [email protected]> wrote: > >> > >> Thanks for the explanation. > >> > >> I understand the approach.. but in my particular use case, I cannot > reasonably justify inflating my application size from 7Mb to 77Mb just to > add functionality amounting to less than 1% of all functionality. > >> > >> I guess there’s no way to surgically extract just the PDF metadata > parsing functionality from Tika ? > >> > >> Laurence > >> > >>> On 26 Sep 2020, at 18:04, Keith Bennett <[email protected]> > wrote: > >>> > >>> Tika coordinates the use of many external-to-Tika parser libraries, > each > >>> with their own dependencies, for parsing many types of files. These > parser > >>> libraries are bundled into the tika-app jar file for your convenience. > I > >>> believe it's these libraries that make up the bulk of the download. For > >>> example, if you unzip the jar file and inspect the contents, you can > see > >>> that just one of these parsers, poi, consists of 24 MB: > >>> > >>> % cd org/apache/poi > >>> > >>> % du -sh . > >>> 24M . > >>> > >>> - Keith > >>> > >>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe < > [email protected]> > >>>> wrote: > >>>> > >>>> I found Tika during a quest to extract PDF metadata in Java. Did i > screw > >>>> up the JAR download, or is Tika really 70MB ? > >>>> > >>>> Kind regards, > >>>> > >>>> Laurence > >>>> > >>>> > >> > > > >
