[
https://issues.apache.org/jira/browse/TIKA-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959362#comment-15959362
]
Hudson commented on TIKA-2317:
------------------------------
UNSTABLE: Integrated in Jenkins build Tika-trunk #1235 (See
[https://builds.apache.org/job/Tika-trunk/1235/])
TIKA-2317 -- warn when content string is truncated, allow easier (tallison:
[https://github.com/apache/tika/commit/246133a2d4ba6980217e04efabacef652a4a460c])
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java
* (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit)
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractProfilerBuilder.java
* (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/batch/ExtractComparerBuilder.java
> Add alert that string was truncated before counting tokens
> ----------------------------------------------------------
>
> Key: TIKA-2317
> URL: https://issues.apache.org/jira/browse/TIKA-2317
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Priority: Trivial
>
> As a memory safety feature, there's a hard limit in the length of the string
> that is processed by the token counter. We should alert the user to when the
> string is truncated because comparisons can be misleading in the case that
> extractA packs more words into the first 1000000 characters than does
> extractB even though there are actually more tokens in extractB.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)