[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696025#comment-14696025
 ] 

Yaniv Kunda commented on TIKA-1706:
-----------------------------------

I agree that generally adding an external dependency to a core module might 
have an impact,
but consider that unlike tika-core, commons-io is a true low-level library:
it has no compile-time dependencies and is used by >2500 projects in maven 
central alone.

I believe that copying the code of another library, frozen in time (in this 
case since 2008), hinders innovation and reduces the chance that anyone will 
utilize new improvements and fixes in newer commons-io since:
# it is disconnected from tika and requires manual discovery and research (if 
commons-io is used as an external dependency it's easy to find deprecated 
methods and their replacements using static analysis)
# it requires manual maintenance of copying select classes/code

It's not easy summing more than 7 years of changes in common-io, but here are 
some beneficial changes I found along the way:
- Use org.apache.commons.io.output.ByteArrayOutputStream instead of 
java.io.ByteArrayOutputStream (this class is actually not that new, but can 
benefit many uses and save a lot of byte-copying) - this has been further 
improved by providing an optimized InputStream from a 
org.apache.commons.io.output.ByteArrayOutputStream (IO-137)
- Allow using Charset instead of String encoding (IO-318)
- Use StringBuilderWriter instead of StringWriter to avoid unnecessary 
synchronization (IO-140)

Obviously, I did not propose this change just for the sake of disrupting the 
peace, but I plan and started a series of patches to utilize newer commons-io, 
which will follow - each in its own issue - once and if commons-io is added as 
a dependency to tika-core.


> Bring back commons-io to tika-core
> ----------------------------------
>
>                 Key: TIKA-1706
>                 URL: https://issues.apache.org/jira/browse/TIKA-1706
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>            Reporter: Yaniv Kunda
>            Priority: Minor
>             Fix For: 1.11
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to