[jira] [Commented] (PARQUET-1784) Column-wise configuration

Walid Gara (Jira) Thu, 06 Feb 2020 15:12:18 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032008#comment-17032008
 ]


Walid Gara commented on PARQUET-1784:
-------------------------------------

[~gszadovszky]
{quote}
I'm working on a general concept of allowing configuration to be set for 
specific columns. See PARQUET-1784 for details.
What do you think of having the mentioned configuration as follows?
{code}
conf.set("parquet.bloom.filter.enabled", false); // Might not be required as 
this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be 
necessary as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary 
as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);
{code}
This might require more writing but more clear and less error prone.
{quote}
I made some research about passing multiple values to the config parameter. In 
Spark, Hive, Hadoop and ORC, they use a *comma-separated list*. I'm wondering 
whether this new style of configuration breaks the UX.

Examples of properties:
 * Spark: spark.ui.filters
 * Hive: hive.metastore.end.function.listeners
 * Hadoop: yarn.app.mapreduce.am.env
 * ORC: orc.bloom.filter.columns

Your suggestion sounds good to me and I find it flexible and less error-prone 
as you said.

 

> Column-wise configuration
> -------------------------
>
>                 Key: PARQUET-1784
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1784
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1784) Column-wise configuration

Reply via email to