[
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032008#comment-17032008
]
Walid Gara commented on PARQUET-1784:
-------------------------------------
[~gszadovszky]
{quote}
I'm working on a general concept of allowing configuration to be set for
specific columns. See PARQUET-1784 for details.
What do you think of having the mentioned configuration as follows?
{code}
conf.set("parquet.bloom.filter.enabled", false); // Might not be required as
this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be
necessary as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary
as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);
{code}
This might require more writing but more clear and less error prone.
{quote}
I made some research about passing multiple values to the config parameter. In
Spark, Hive, Hadoop and ORC, they use a *comma-separated list*. I'm wondering
whether this new style of configuration breaks the UX.
Examples of properties:
* Spark: spark.ui.filters
* Hive: hive.metastore.end.function.listeners
* Hadoop: yarn.app.mapreduce.am.env
* ORC: orc.bloom.filter.columns
Your suggestion sounds good to me and I find it flexible and less error-prone
as you said.
> Column-wise configuration
> -------------------------
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
> Labels: pull-request-available
>
> After adding some new statistics and encodings into Parquet it is getting
> very hard to be smart and choose the best configs automatically. For example
> for which columns should we save column index and/or bloom-filters? Is it
> worth using dictionary for a column that we know will fall back to another
> encoding?
> The idea of this feature is to allow the library user to fine-tune the
> configuration by setting it column-wise. To support this we extend the
> existing configuration keys by a suffix to identify the related column. (From
> now on we introduce new keys following the same syntax.)
> \{key of the configuration}{{#}}\{column path in the file schema}
> For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with
> the implementation of some existing configs where it make sense (e.g.
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of
> this effort.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)