Github user MickDavies commented on the pull request:
https://github.com/apache/spark/pull/4187#issuecomment-71326437
The dictionary already exists, the change will cause an additional array to
be created to hold the converted values, but I do not think this is very
significant. I guess it is possible that the converted Strings in the array
themselves increase non-short lived memory - but this is probably not a cost as
they will very likely have been referenced further up stream in the Spark code.
Adding an array to hold converted String values seems to be the pattern for
the implementation of this form of converter and a number of similar examples
can be seen in the Parquet code base, for example:
parquet.avro.AvroIndexedRecordConverter.FieldStringConverter
The improved performance is not only due to reduced cpu from performing
less UTF8 conversion, but also due to the significant reduction in String
creation resulting in less GC time.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]