[
https://issues.apache.org/jira/browse/TIKA-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959319#comment-15959319
]
Hudson commented on TIKA-2317:
------------------------------
FAILURE: Integrated in Jenkins build tika-2.x-windows #189 (See
[https://builds.apache.org/job/tika-2.x-windows/189/])
TIKA-2317 warn user if max content length is hit; allow for easier (tallison:
rev 67a5e91b2a4157ee06f924280b0b828819c88223)
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractProfilerBuilder.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractComparerBuilder.java
* (edit) tika-eval/src/main/resources/log4j.properties
* (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java
* (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java
* (edit)
tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
* (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
* (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
* (edit)
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java
* (edit) tika-eval/src/main/resources/profile-reports.xml
* (edit) tika-eval/src/main/resources/comparison-reports.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/JDBCUtil.java
> Add alert that string was truncated before counting tokens
> ----------------------------------------------------------
>
> Key: TIKA-2317
> URL: https://issues.apache.org/jira/browse/TIKA-2317
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Priority: Trivial
>
> As a memory safety feature, there's a hard limit in the length of the string
> that is processed by the token counter. We should alert the user to when the
> string is truncated because comparisons can be misleading in the case that
> extractA packs more words into the first 1000000 characters than does
> extractB even though there are actually more tokens in extractB.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)