Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

Konstantin Gribov Wed, 24 Aug 2016 11:39:00 -0700

As I know proguard does such tracing internally but it works only for
trivial cases (like `Class.forName` with string constant, see [1]).
Another simple was is to monitor which classes were loaded with
`-verbose:class` in case of hotspot [2].


But second way wouldn't show classes which weren't loaded because of lack
of tests like with ctakes parser.
At least, such method catches SPI and alike dynamic loading of
plugins/modules.

Also we have optional deps like Stanford CoreNLP (because of license AFAIK)
which wouldn't be covered with either method.

It would be hard to do fine grained exclusion but I advocate for coarse
grained one.
It could give noticable result with moderate effort, IMHO.

To be honest, I just exclude edu.ucar and similar deps because of their
huge footprint when use Tika since I can trade off support of some
scientific formats for smaller footprint in my cases and this issue doesn't
affect me directly.

[1]: http://proguard.sourceforge.net/index.html#manual/usage.html
[2]: http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtm


ср, 24 авг. 2016 г. в 21:16, Ken Krugler <kkrugler_li...@transpac.com>:

> I think excluding more deps would be good…but challenging.
>
> The problem is that some of the jars only wind up getting used for edge
> cases (e.g. you have an encrypted email, and so you need bouncy castle, or
> something like that which had bitten me in the past).
>
> So it’s hard to know what’s really required or not. Is there a good Java
> tool for tracing all possible calls from starting points, to see if it’s
> even possible to reach a jar?
>
> Though that would need some help for cases where we’re dynamically loading
> classes (mostly plug-in support?)
>
> — Ken
>
>
> > On Aug 24, 2016, at 10:59am, Konstantin Gribov <gros...@gmail.com>
> wrote:
> >
> > Hi, folks.
> >
> > It seems that we have too much dependencies in `tika-parsers` and many of
> > them could actually be not used. As Tim found in TIKA-2007 [1]
> > `jackson-core` wasn't necessary for `tika-parsers` at all.
> >
> > When I looked into current parser deps I found a lot of strange deps like
> > `quartz` with `c3p0` (jdbc connection pool impl) and `ehcache-core` via
> > `cdm`, lucene parts (via `ctakes-core`), spring framework 3.x (also via
> > `ctakes-core`) et cetera. Latter could even break app if you have another
> > spring version in transitive deps.
> >
> > Also, there seems to be no tests for ctakes parser on the first glance
> and
> > I have no easy way to check what I can exclude from deps without breaking
> > things.
> >
> > What do you think about shrinking some of such deps? With at least
> minimal
> > test coverage to ensure common usecases won't be broken, of course.
> >
> > [1]:
> >
> https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206
> > --
> >
> > Best regards,
> > Konstantin Gribov
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
> --

Best regards,
Konstantin Gribov

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

Reply via email to