[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r83039531 --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala --- @@ -0,0 +1,78 @@ +package org.apache.carbondata.spark.testsuite.dataload + +import java.io.File + +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.util.CarbonProperties +import org.apache.spark.sql.common.util.CarbonHiveContext._ +import org.apache.spark.sql.common.util.QueryTest +import org.scalatest.BeforeAndAfterAll +import org.apache.spark.sql.Row + +/** + * Created by x00381807 on 2016/9/26. --- End diff -- Oh, my fault --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82523962 --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala --- @@ -0,0 +1,78 @@ +package org.apache.carbondata.spark.testsuite.dataload + +import java.io.File + +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.util.CarbonProperties +import org.apache.spark.sql.common.util.CarbonHiveContext._ +import org.apache.spark.sql.common.util.QueryTest +import org.scalatest.BeforeAndAfterAll +import org.apache.spark.sql.Row + +/** + * Created by x00381807 on 2016/9/26. --- End diff -- please remove --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82977592 --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java --- @@ -102,8 +102,8 @@ public void initialize() throws IOException { parserSettings.setMaxColumns( getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); parserSettings.setNullValue(""); -parserSettings.setIgnoreLeadingWhitespaces(false); -parserSettings.setIgnoreTrailingWhitespaces(false); +parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); --- End diff -- pros of this approach will be suppose in one load user loaded with dirty data and suddenly he realizes no i need to trim then in the next load he will enable the option and load the data, this will increase the dictionary space also, also in query dictionary lookup overhead will increase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82968468 --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java --- @@ -102,8 +102,8 @@ public void initialize() throws IOException { parserSettings.setMaxColumns( getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); parserSettings.setNullValue(""); -parserSettings.setIgnoreLeadingWhitespaces(false); -parserSettings.setIgnoreTrailingWhitespaces(false); +parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); --- End diff -- Guys i think if while data loading we are reading from configuration inorder to trim it or not same we need to do while doing filter also, based on configuration value decide. ex: while loading i enabled trim property, so system will trim and load the data, now in filter query also if user provides while space then it needs to be trimmed while creating filter model. This will provide more system consistentency. if user enable trim then we wont trim it while loading and also while querying. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82960653 --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java --- @@ -102,8 +102,8 @@ public void initialize() throws IOException { parserSettings.setMaxColumns( getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); parserSettings.setNullValue(""); -parserSettings.setIgnoreLeadingWhitespaces(false); -parserSettings.setIgnoreTrailingWhitespaces(false); +parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); --- End diff -- hi sujith, I agree with Eason, when user query with some space filter, it should be considered as forbidden action. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user eason-lyx commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82923303 --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java --- @@ -102,8 +102,8 @@ public void initialize() throws IOException { parserSettings.setMaxColumns( getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); parserSettings.setNullValue(""); -parserSettings.setIgnoreLeadingWhitespaces(false); -parserSettings.setIgnoreTrailingWhitespaces(false); +parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); --- End diff -- hi sujith i am thinking if user already trim the data with the option setting,then when user query with some space filter,it would no getting result. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r82758567 --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java --- @@ -102,8 +102,8 @@ public void initialize() throws IOException { parserSettings.setMaxColumns( getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); parserSettings.setNullValue(""); -parserSettings.setIgnoreLeadingWhitespaces(false); -parserSettings.setIgnoreTrailingWhitespaces(false); +parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); --- End diff -- Same you need to handle in CarbonFilters.scala also, since while processing filter expressions the spaces are not getting trimmed, so here also you need to take care based on your feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r81306658 --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java --- @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info model.setEscapeCharacter(schmaModel.getEscapeCharacter()); model.setQuoteCharacter(schmaModel.getQuoteCharacter()); model.setCommentCharacter(schmaModel.getCommentCharacter()); +model.setTrim(schmaModel.getTrim()); --- End diff -- In some cases, users want to trim some string field in csv, he set in Load DML option, and we need to handle to in csvinput step. So we the trim should be transformed to CSVinput step. For Example, in some usecases, the LeadingWhiteSpace in a String, like " fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add this option for users, let them choose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r81281079 --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java --- @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info model.setEscapeCharacter(schmaModel.getEscapeCharacter()); model.setQuoteCharacter(schmaModel.getQuoteCharacter()); model.setCommentCharacter(schmaModel.getCommentCharacter()); +model.setTrim(schmaModel.getTrim()); --- End diff -- Not clear about the usecase, can you make it more clear by providing more details --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r81281050 --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java --- @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info model.setEscapeCharacter(schmaModel.getEscapeCharacter()); model.setQuoteCharacter(schmaModel.getQuoteCharacter()); model.setCommentCharacter(schmaModel.getCommentCharacter()); +model.setTrim(schmaModel.getTrim()); --- End diff -- why we need to set this in schemamodel? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---