Hi devs,
I ran into an issue where a test file that contained UTF-8 text was being
displayed in Eclipse as us-ascii.
I had thought that Tika would use UTF-8 everywhere for file encodings, but…
Currently the tika-parent/pom.xml has:
<properties>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<project.reporting.outputEncoding>${project.build.sourceEncoding}</project.reporting.outputEncoding>
<commons.compress.version>1.10</commons.compress.version>
<commons.io.version>2.4</commons.io.version>
<slf4j.version>1.7.12</slf4j.version>
<pax.exam.version>4.4.0</pax.exam.version>
</properties>
Note that project.reporting.outputEncoding is set to
project.build.sourceEncoding, but that's not specified anywhere.
Is there a reason for this? I can go ahead and switch it to be explicitly
UTF-8, in the 2.x branch.
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr