[
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696025#comment-14696025
]
Yaniv Kunda commented on TIKA-1706:
-----------------------------------
I agree that generally adding an external dependency to a core module might
have an impact,
but consider that unlike tika-core, commons-io is a true low-level library:
it has no compile-time dependencies and is used by >2500 projects in maven
central alone.
I believe that copying the code of another library, frozen in time (in this
case since 2008), hinders innovation and reduces the chance that anyone will
utilize new improvements and fixes in newer commons-io since:
# it is disconnected from tika and requires manual discovery and research (if
commons-io is used as an external dependency it's easy to find deprecated
methods and their replacements using static analysis)
# it requires manual maintenance of copying select classes/code
It's not easy summing more than 7 years of changes in common-io, but here are
some beneficial changes I found along the way:
- Use org.apache.commons.io.output.ByteArrayOutputStream instead of
java.io.ByteArrayOutputStream (this class is actually not that new, but can
benefit many uses and save a lot of byte-copying) - this has been further
improved by providing an optimized InputStream from a
org.apache.commons.io.output.ByteArrayOutputStream (IO-137)
- Allow using Charset instead of String encoding (IO-318)
- Use StringBuilderWriter instead of StringWriter to avoid unnecessary
synchronization (IO-140)
Obviously, I did not propose this change just for the sake of disrupting the
peace, but I plan and started a series of patches to utilize newer commons-io,
which will follow - each in its own issue - once and if commons-io is added as
a dependency to tika-core.
> Bring back commons-io to tika-core
> ----------------------------------
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
> Issue Type: Improvement
> Components: core
> Reporter: Yaniv Kunda
> Priority: Minor
> Fix For: 1.11
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset
> objects instead of encoding names, being able to use StringBuilder instead of
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes
> with commons-io classes if this is accepted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)