[
https://issues.apache.org/jira/browse/CARBONDATA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361080#comment-15361080
]
ASF GitHub Bot commented on CARBONDATA-35:
------------------------------------------
Github user QiangCai commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/16#discussion_r69431460
--- Diff:
integration/spark/src/main/scala/org/carbondata/spark/util/GlobalDictionaryUtil.scala
---
@@ -470,12 +593,22 @@ object GlobalDictionaryUtil extends Logging {
else {
carbonLoadModel.getCsvHeader.split("" +
CSVWriter.DEFAULT_SEPARATOR)
}
- val (requireDimension, requireColumnNames) =
pruneDimensions(dimensions, headers, df.columns)
+ // generate global dict from pre defined column dict file
+ val colDictFilePath = carbonLoadModel.getColDictFilePath
+ carbonLoadModel.initPredefDictMap()
--- End diff --
please put inside next line
> generate global dict using pre-defined dict from external column file
> ---------------------------------------------------------------------
>
> Key: CARBONDATA-35
> URL: https://issues.apache.org/jira/browse/CARBONDATA-35
> Project: CarbonData
> Issue Type: New Feature
> Reporter: Jay
> Priority: Minor
>
> user can set colName:columnfilePath in load DML, which can provide small
> amount of distinct values, then carbon can use these distinct values to
> generate dictionary and avoid reading from large raw csv file. this is a new
> feature and can improve the performance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)