Github user MickDavies commented on the pull request:
https://github.com/apache/spark/pull/4187#issuecomment-71377752
I don' think the line in question is hot, but I think your suggestions are
good so I have made the changes.
I also looked a bit more into Parquet code. I think that the array will be
created per column per row group. It looks like Parquet uses a dictionary until
a max number of bytes per column per row group have been added - extract from
ParquetOutputFormat
```java
* # There is one dictionary page per column per row group when dictionary
encoding is used.
* # The dictionary page size works like the page size but for dictionary
* parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 *
1024
```
and
```java
/**
* Will attempt to encode values using a dictionary and fall back to plain
encoding
* if the dictionary gets too big
*
* @author Julien Le Dem
*
*/
public abstract class DictionaryValuesWriter extends ValuesWriter
implements RequiresFallback
```
Where bytes is the size of the Binary data with an 4 byte overhead per
entry. I think this imposes an upper limit on size of this array, and the
related Strings to a few MB.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]