Abstracting CarbonData's Index Interface
Hi community, Currently CarbonData have builtin index support which is one of the key strength of CarbonData. Using index, CarbonData can do very fast filter query by filtering on block and blocklet level. However, it also introduces memory consumption of the index tree and impact first query time because the process of loading of index from file footer into memory. On the other side, in a multi-tennant environment, multiple applications may access data files simultaneously, which again exacerbate this resource consumption issue. So, I want to propose and discuss a solution with you to solve this problem and make an abstraction of interface for CarbonData's future evolvement. I am thinking the final result of this work should achieve at least two goals: Goal 1: User can choose the place to store Index data, it can be stored in processing framework's memory space (like in spark driver memory) or in another service outside of the processing framework (like using a independent database service) Goal 2: Developer can add more index of his choice to CarbonData files. Besides B+ tree on multi-dimensional key which current CarbonData supports, developers are free to add other indexing technology to make certain workload faster. These new indices should be added in a pluggable way. In order to achieve these goals, an abstraction need to be created for CarbonData project, including: - Segment: each segment is presenting one load of data, and tie with some indices created with this load - Index: index is created when this segment is created, and is leveraged when CarbonInputFormat's getSplit is called, to filter out the required blocks or even blocklets. - CarbonInputFormat: There maybe n number of indices created for data file, when querying these data files, InputFormat should know how to access these indices, and initialize or loading these index if required. Obviously, this work should be separated into different tasks and implemented gradually. But first of all, let's discuss on the goal and the proposed approach. What is your idea? Regards, Jacky -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[GitHub] incubator-carbondata pull request #189: [CARBONDATA-267] Set block_size for ...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/189#discussion_r81310297 --- Diff: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala --- @@ -419,6 +420,7 @@ class TableNewProcessor(cm: tableModel, sqlContext: SQLContext) { schemaEvol .setSchemaEvolutionEntryList(new util.ArrayList[SchemaEvolutionEntry]()) tableSchema.setTableId(UUID.randomUUID().toString) + tableSchema.setBlocksize(Integer.parseInt(cm.tableBlockSize.getOrElse(0).toString)) --- End diff -- What if user will provide value as 1024M or 1MB, you will get exception and the same has not been handled. I think we need to handle --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #205: [CARBONDATA-281]Vt enhancement for t...
GitHub user ravikiran23 opened a pull request: https://github.com/apache/incubator-carbondata/pull/205 [CARBONDATA-281]Vt enhancement for the Life cycle management module why raise this PR ? Improving the test cases for the LCM module. Test cases for the scenarios like : 1. added the boundry tests for compaction like 0 loads , 1 load. 2. added blocklet boundry test case. 3. added the tests to verify the minor compaction threshold. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravikiran23/incubator-carbondata VTEnhancement1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/205.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #205 commit ff4d2a3701e931fa56a8b70a5d4b531324127a51 Author: ravikiranDate: 2016-09-08T12:22:37Z enhancement of VT for LCM. commit 84fa3c743a74583bf5579d9b4e69016b33f5b46a Author: ravikiran Date: 2016-09-20T17:32:23Z enhancing vt commit f098b647d5a14961a2ef4603696d99a4896cbbea Author: ravikiran Date: 2016-09-30T09:23:50Z adding the blocklet test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-281) improve the test cases in LCM module.
ravikiran created CARBONDATA-281: Summary: improve the test cases in LCM module. Key: CARBONDATA-281 URL: https://issues.apache.org/jira/browse/CARBONDATA-281 Project: CarbonData Issue Type: Improvement Components: spark-integration Affects Versions: 0.1.0-incubating Reporter: ravikiran Assignee: ravikiran Priority: Minor improving the test cases in the lcm. adding the test cases for the compaction with boundary test cases. added test cases to verify the minor compaction threshold check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r81306658 --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java --- @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info model.setEscapeCharacter(schmaModel.getEscapeCharacter()); model.setQuoteCharacter(schmaModel.getQuoteCharacter()); model.setCommentCharacter(schmaModel.getCommentCharacter()); +model.setTrim(schmaModel.getTrim()); --- End diff -- In some cases, users want to trim some string field in csv, he set in Load DML option, and we need to handle to in csvinput step. So we the trim should be transformed to CSVinput step. For Example, in some usecases, the LeadingWhiteSpace in a String, like " fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add this option for users, let them choose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #189: [CARBONDATA-267] Set block_size for ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/189#discussion_r81299063 --- Diff: format/src/main/thrift/schema.thrift --- @@ -124,6 +124,7 @@ struct TableSchema{ 1: required string table_id; // ID used to 2: required list table_columns; // Columns in the table 3: required SchemaEvolution schema_evolution; // History of schema evolution of this table + 4: optional i32 block_size --- End diff -- @Zhangshunyu ...if you register the property in hive metastore, the proeprty will never be lost even on restart of thrift server as we do for tablepath...you can create table flow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---