[ 
https://issues.apache.org/jira/browse/LUCENE-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227671#comment-14227671
 ] 

Jack Krupansky commented on LUCENE-6079:
----------------------------------------

But the pattern might in fact need the entire input, such as to match the end 
of the input with "$".

Still, it would be nice to have an optional "chunked mode" for cases such as 
this (assuming that pattern doesn't end with "$"), such as input which is the 
full text of a multi-MB PDF file. I would suggest that such as mode be the 
default, with a reasonable chunk size such as 100K. There should also be an 
"overlap" size so that when reading the next chunk it would start matching with 
an overlap from the end of the previous chunk, and not perform a match that 
extends into the overlap area at the end of a chunk unless it is the last 
chunk, so that matches could be made across chunk boundaries.

Actually, it turns out that there was such a feature, with a "maxBlockChars" 
parameter, but it was deprecated long ago - no mention in CHANGES.TXT. But... 
it's still supported in the factory code, with only a "TODO" comment suggesting 
that a warning would be appropriate, but the actual Lucene filter constructor 
simply ignores this parameter.



> PatternReplaceCharFilter crashes JVM with OutOfMemoryError
> ----------------------------------------------------------
>
>                 Key: LUCENE-6079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6079
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.10.2
>         Environment: Microsoft Windows, x86_64, 32 GB main memory
>            Reporter: Alexander Veit
>            Priority: Critical
>
> PatternReplaceCharFilter fills memory with input data until an 
> OutOfMemoryError is thrown.
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:3332)
>       at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
>       at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
>       at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
>       at java.lang.StringBuilder.append(StringBuilder.java:190)
>       at 
> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.fill(PatternReplaceCharFilter.java:84)
>       at 
> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:74)
>     ...
> PatternReplaceCharFilter should read data chunk-wise and pass the transformed 
> output chunk-wise to the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to