Abstracting CarbonData's Index Interface

2016-09-30 Thread Jacky Li
Hi community,

Currently CarbonData have builtin index support which is one of the key
strength of CarbonData. Using index, CarbonData can do very fast filter
query by filtering on block and blocklet level. However, it also introduces
memory consumption of the index tree and impact first query time because the
process of loading of index from file footer into memory. On the other side,
in a multi-tennant environment, multiple applications may access data files
simultaneously, which again exacerbate this resource consumption issue. 
So, I want to propose and discuss a solution with you to solve this
problem and make an abstraction of interface for CarbonData's future
evolvement.
I am thinking the final result of this work should achieve at least two
goals:

Goal 1: User can choose the place to store Index data, it can be stored in
processing framework's memory space (like in spark driver memory) or in
another service outside of the processing framework (like using a
independent database service)

Goal 2: Developer can add more index of his choice to CarbonData files.
Besides B+ tree on multi-dimensional key which current CarbonData supports,
developers are free to add other indexing technology to make certain
workload faster. These new indices should be added in a pluggable way.

 In order to achieve these goals, an abstraction need to be created for
CarbonData project, including: 

- Segment: each segment is presenting one load of data, and tie with some
indices created with this load

- Index: index is created when this segment is created, and is leveraged
when CarbonInputFormat's getSplit is called, to filter out the required
blocks or even blocklets.

- CarbonInputFormat: There maybe n number of indices created for data file,
when querying these data files, InputFormat should know how to access these
indices, and initialize or loading these index if required.

Obviously, this work should be separated into different tasks and
implemented gradually. But first of all, let's discuss on the goal and the
proposed approach. What is your idea? 
 
Regards,
Jacky





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[GitHub] incubator-carbondata pull request #189: [CARBONDATA-267] Set block_size for ...

2016-09-30 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/189#discussion_r81310297
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -419,6 +420,7 @@ class TableNewProcessor(cm: tableModel, sqlContext: 
SQLContext) {
 schemaEvol
   .setSchemaEvolutionEntryList(new 
util.ArrayList[SchemaEvolutionEntry]())
 tableSchema.setTableId(UUID.randomUUID().toString)
+
tableSchema.setBlocksize(Integer.parseInt(cm.tableBlockSize.getOrElse(0).toString))
--- End diff --

What if user will provide value as 1024M or 1MB, you will get exception and 
the same has not been handled.
I think we need to handle


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #205: [CARBONDATA-281]Vt enhancement for t...

2016-09-30 Thread ravikiran23
GitHub user ravikiran23 opened a pull request:

https://github.com/apache/incubator-carbondata/pull/205

[CARBONDATA-281]Vt enhancement for the Life cycle management module

why raise this PR ? 
Improving the test cases for the LCM module.
Test cases for the scenarios like : 
1. added the boundry tests for compaction like 0 loads , 1 load.
2. added blocklet boundry test case.
3. added the tests to verify the minor compaction threshold.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravikiran23/incubator-carbondata 
VTEnhancement1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/205.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #205


commit ff4d2a3701e931fa56a8b70a5d4b531324127a51
Author: ravikiran 
Date:   2016-09-08T12:22:37Z

enhancement of VT for LCM.

commit 84fa3c743a74583bf5579d9b4e69016b33f5b46a
Author: ravikiran 
Date:   2016-09-20T17:32:23Z

enhancing vt

commit f098b647d5a14961a2ef4603696d99a4896cbbea
Author: ravikiran 
Date:   2016-09-30T09:23:50Z

adding the blocklet test case.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (CARBONDATA-281) improve the test cases in LCM module.

2016-09-30 Thread ravikiran (JIRA)
ravikiran created CARBONDATA-281:


 Summary: improve the test cases in LCM module.
 Key: CARBONDATA-281
 URL: https://issues.apache.org/jira/browse/CARBONDATA-281
 Project: CarbonData
  Issue Type: Improvement
  Components: spark-integration
Affects Versions: 0.1.0-incubating
Reporter: ravikiran
Assignee: ravikiran
Priority: Minor


improving the test cases in the lcm.
 adding the test cases for the compaction with boundary test cases.
 added test cases to verify the minor compaction threshold check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-09-30 Thread lion-x
Github user lion-x commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r81306658
  
--- Diff: 
hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java 
---
@@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus 
schmaModel, SchemaInfo info
 model.setEscapeCharacter(schmaModel.getEscapeCharacter());
 model.setQuoteCharacter(schmaModel.getQuoteCharacter());
 model.setCommentCharacter(schmaModel.getCommentCharacter());
+model.setTrim(schmaModel.getTrim());
--- End diff --

In some cases, users want to trim some string field in csv, he set in Load 
DML option, and we need to handle to in csvinput step. So we the trim should be 
transformed to CSVinput step.

For Example, in some usecases, the LeadingWhiteSpace in a String, like " 
fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add 
this option for users, let them choose.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #189: [CARBONDATA-267] Set block_size for ...

2016-09-30 Thread manishgupta88
Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/189#discussion_r81299063
  
--- Diff: format/src/main/thrift/schema.thrift ---
@@ -124,6 +124,7 @@ struct TableSchema{
1: required string table_id;  // ID used to
2: required list table_columns; // Columns in the table
3: required SchemaEvolution schema_evolution; // History of schema 
evolution of this table
+   4: optional i32 block_size
--- End diff --

@Zhangshunyu ...if you register the property in hive metastore, the 
proeprty will never be lost even on restart of thrift server as we do for 
tablepath...you can create table flow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---