[
https://issues.apache.org/jira/browse/LUCENE-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227671#comment-14227671
]
Jack Krupansky commented on LUCENE-6079:
----------------------------------------
But the pattern might in fact need the entire input, such as to match the end
of the input with "$".
Still, it would be nice to have an optional "chunked mode" for cases such as
this (assuming that pattern doesn't end with "$"), such as input which is the
full text of a multi-MB PDF file. I would suggest that such as mode be the
default, with a reasonable chunk size such as 100K. There should also be an
"overlap" size so that when reading the next chunk it would start matching with
an overlap from the end of the previous chunk, and not perform a match that
extends into the overlap area at the end of a chunk unless it is the last
chunk, so that matches could be made across chunk boundaries.
Actually, it turns out that there was such a feature, with a "maxBlockChars"
parameter, but it was deprecated long ago - no mention in CHANGES.TXT. But...
it's still supported in the factory code, with only a "TODO" comment suggesting
that a warning would be appropriate, but the actual Lucene filter constructor
simply ignores this parameter.
> PatternReplaceCharFilter crashes JVM with OutOfMemoryError
> ----------------------------------------------------------
>
> Key: LUCENE-6079
> URL: https://issues.apache.org/jira/browse/LUCENE-6079
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 4.10.2
> Environment: Microsoft Windows, x86_64, 32 GB main memory
> Reporter: Alexander Veit
> Priority: Critical
>
> PatternReplaceCharFilter fills memory with input data until an
> OutOfMemoryError is thrown.
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
> at java.lang.StringBuilder.append(StringBuilder.java:190)
> at
> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.fill(PatternReplaceCharFilter.java:84)
> at
> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:74)
> ...
> PatternReplaceCharFilter should read data chunk-wise and pass the transformed
> output chunk-wise to the caller.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]