[
https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Randy Tidd updated PARQUET-1808:
--------------------------------
Description:
This method in SimpleGroup uses `+=` for String concatenation which is a known
performance problem in Java, the performance degrades exponentially the more
strings that are added.
[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
We ran into a performance problem whereby a single column in a Parquet file was
defined as a group:
{code:java}
optional group customer_ids (LIST) {
repeated group list {
optional binary element (STRING);
}
}{code}
and had over 31,000 values. Reading this single column took over 8 minutes due
to time spent in the `toString()` method. Using a different implementation
that uses `StringBuffer` like this:
{code:java}
StringBuffer result = new StringBuffer();
int i = 0;
for (Type field : schema.getFields()) {
String name = field.getName();
List<Object> values = data[i];
++i;
if (values != null) {
if (values.size() > 0) {
for (Object value : values) {
result.append(indent);
result.append(name);
if (value == null) {
result.append(": NULL\n");
} else if (value instanceof Group){
result.append("\n");
result.append(betterToString((SimpleGroup)value, indent+" "));
} else {
result.append(": ");
result.append(value.toString());
result.append("\n");
}
}
}
}
}
return result.toString();{code}
reduced that time to less than 500 milliseconds.
The existing implementation is really poor and exhibits an infamous Java string
performance issue and should be fixed.
This was a significant problem for us but we were able to work around it so I
am marking this issue as "Minor".
was:
This method in SimpleGroup uses `+=` for String concatenation which is a known
performance problem in Java, the performance degrades exponentially the more
strings that are added.
[https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
We ran into a performance problem whereby a single column in a Parquet file was
defined as a group:
{quote} optional group customer_ids (LIST) {
repeated group list {
optional binary element (STRING);
}
}
{quote}
and had over 31,000 values. Reading this single column took over 8 minutes due
to time spent in the `toString()` method. Using a different implementation
that uses `StringBuffer` like this:
StringBuffer result = new StringBuffer();
int i = 0;
for (Type field : schema.getFields()) {
String name = field.getName();
List<Object> values = data[i];
++i;
if (values != null) {
if (values.size() > 0) {
for (Object value : values) {
result.append(indent);
result.append(name);
if (value == null) {
result.append(": NULL\n");
} else if (value instanceof Group) {
result.append("\n");
result.append(betterToString((SimpleGroup)value, indent+" "));
} else {
result.append(": ");
result.append(value.toString());
result.append("\n");
}
}
}
}
}
return result.toString();
reduced that time to less than 500 milliseconds.
The existing implementation is really poor and exhibits an infamous Java string
performance issue and should be fixed.
This was a significant problem for us but we were able to work around it so I
am marking this issue as "Minor".
> SimpleGroup.toString() uses String += and so has poor performance
> -----------------------------------------------------------------
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.11.0
> Reporter: Randy Tidd
> Priority: Minor
>
> This method in SimpleGroup uses `+=` for String concatenation which is a
> known performance problem in Java, the performance degrades exponentially the
> more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file
> was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
> repeated group list {
> optional binary element (STRING);
> }
> }{code}
>
> and had over 31,000 values. Reading this single column took over 8 minutes
> due to time spent in the `toString()` method. Using a different
> implementation that uses `StringBuffer` like this:
> {code:java}
> StringBuffer result = new StringBuffer();
> int i = 0;
> for (Type field : schema.getFields()) {
> String name = field.getName();
> List<Object> values = data[i];
> ++i;
> if (values != null) {
> if (values.size() > 0) {
> for (Object value : values) {
> result.append(indent);
> result.append(name);
> if (value == null) {
> result.append(": NULL\n");
> } else if (value instanceof Group){
> result.append("\n");
> result.append(betterToString((SimpleGroup)value, indent+" "));
> } else {
> result.append(": ");
> result.append(value.toString());
> result.append("\n");
> }
> }
> }
> }
> }
> return result.toString();{code}
> reduced that time to less than 500 milliseconds.
> The existing implementation is really poor and exhibits an infamous Java
> string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I
> am marking this issue as "Minor".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)