[
https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653039#comment-17653039
]
Abhishek Jain commented on PARQUET-2220:
----------------------------------------
very sorry for tagging [~gszadovszky] [~theosib-amazon] . Just want to get this
noticed
> Parquet Filter predicate storing nested string causing OOM's
> ------------------------------------------------------------
>
> Key: PARQUET-2220
> URL: https://issues.apache.org/jira/browse/PARQUET-2220
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Abhishek Jain
> Priority: Critical
>
> Each Instance of ColumnFilterPredicate stores the filter values in toString
> variable eagerly. Which is not useful
> {code:java}
> static abstract class ColumnFilterPredicate<T extends Comparable<T>>
> implements FilterPredicate, Serializable {
> private final Column<T> column;
> private final T value;
> private final String toString;
> protected ColumnFilterPredicate(Column<T> column, T value) {
> this.column = Objects.requireNonNull(column, "column cannot be null");
> // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not,
> so they guard against
> // null in their own constructors.
> this.value = value;
> String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
> this.toString = name + "(" + column.getColumnPath().toDotString() + ", " +
> value + ")";
> }{code}
>
>
> If your filter predicate is too long/nested this can take a lot of memory
> while creating Filter.
> We have seen in our productions this can go upto 4gbs of space while opening
> multiple parquet readers
> Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is
> eagerly calculated and stored in string and lot of duplication is happening
> while making And/or filter.
> I did not find use case of storing it so eagerly
--
This message was sent by Atlassian Jira
(v8.20.10#820010)