[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147351
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

It is used for iterating RecordReader. I can move it to carbon-hadoop 
module but processing module need to be dependent on it. Already processing 
module is dependent on hadoop module so it becomes dependent on each other.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147018
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

Ok. I will add close method along with finish method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #212: [CARBONDATA-285] Use path parameter ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/212#discussion_r83146927
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/CarbonDatasourceRelation.scala
 ---
@@ -55,18 +55,11 @@ class CarbonSource extends RelationProvider
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-// if path is provided we can directly create Hadoop relation. \
-// Otherwise create datasource relation
-parameters.get("path") match {
-  case Some(path) => CarbonDatasourceHadoopRelation(sqlContext, 
Array(path), parameters, None)
-  case _ =>
-val options = new CarbonOption(parameters)
-val tableIdentifier = options.tableIdentifier.split("""\.""").toSeq
-val identifier = tableIdentifier match {
-  case Seq(name) => TableIdentifier(name, None)
-  case Seq(db, name) => TableIdentifier(name, Some(db))
-}
-CarbonDatasourceRelation(identifier, None)(sqlContext)
+val options = new CarbonOption(parameters)
+if (sqlContext.isInstanceOf[CarbonContext]) {
--- End diff --

sorry, yes `carboncontext.load(path)` cannot work now right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #230: [CARBONDATA-306]Add block size info ...

2016-10-12 Thread Zhangshunyu
Github user Zhangshunyu commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83139950
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --

@Jay357089 I think this is a good idea to extract ConvertByteToReadable as 
a method, since it can be used in many logs, especially for analyzing 
performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #230: [CARBONDATA-306]Add block size info ...

2016-10-12 Thread Jay357089
Github user Jay357089 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83139600
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --

@jackylk  Maybe i should extract if .. else part to a method called 
ConvertByteToReadable, what's your opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #232: [CARBONDATA-310]Fixed compilation fa...

2016-10-12 Thread foryou2030
GitHub user foryou2030 opened a pull request:

https://github.com/apache/incubator-carbondata/pull/232

[CARBONDATA-310]Fixed compilation failure when using spark 1.6.2

# Why raise this pr?
Compilation failed when using spark 1.6.2, because class not found: 
AggregateExpression
# How to solve?
Once Removing the import "import 
org.apache.spark.sql.catalyst.expressions.aggregate._" will cause compilation 
failure when using Spark 1.6.2, in which AggregateExpression is moved to 
subpackage "aggregate". So neeed changing it back.

Thanks for you remind, @harperjiang

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/foryou2030/incubator-carbondata agg_ex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #232


commit ee4f6832d893c6ac99e1694b607b6f2d38ec9231
Author: foryou2030 
Date:   2016-10-13T03:17:38Z

fix compile on spark1.6.2




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #231: [CARBONDATA-311]Log the data size of...

2016-10-12 Thread Zhangshunyu
GitHub user Zhangshunyu opened a pull request:

https://github.com/apache/incubator-carbondata/pull/231

[CARBONDATA-311]Log the data size of blocklet during data load.

## Why raise this pr?
The blocklet size is an important parameter for analyzing data load and 
query, this info should be logged.
## How to test?
Pass all the test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Zhangshunyu/incubator-carbondata logblocklet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/231.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #231


commit a110504f58e688e42223e896f7a1cf729463cf9d
Author: Zhangshunyu 
Date:   2016-10-13T03:17:21Z

Log the data size of each blocklet




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (CARBONDATA-311) Log the data size of blocklet during data load.

2016-10-12 Thread zhangshunyu (JIRA)
zhangshunyu created CARBONDATA-311:
--

 Summary: Log the data size of blocklet during data load.
 Key: CARBONDATA-311
 URL: https://issues.apache.org/jira/browse/CARBONDATA-311
 Project: CarbonData
  Issue Type: Improvement
Affects Versions: 0.1.1-incubating
Reporter: zhangshunyu
Assignee: zhangshunyu
Priority: Minor
 Fix For: 0.2.0-incubating


Log the data size of blocklet during data load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-312) Unify two datasource: CarbonDatasourceHadoopRelation and CarbonDatasourceRelation

2016-10-12 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-312:
---

 Summary: Unify two datasource: CarbonDatasourceHadoopRelation and 
CarbonDatasourceRelation
 Key: CARBONDATA-312
 URL: https://issues.apache.org/jira/browse/CARBONDATA-312
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Jacky Li


Take CarbonDatasourceHadoopRelation as the target datasource definition, after 
that, CarbonContext can use standard Datasource strategy

In this Issue, following change is required:
1. Move the dictionary stratey out of CarbonTableScan, make a separate strategy 
for it.
2. CarbonDatasourceHadoopRelation should use CarbonScanRDD in buildScan 
function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Disscusion shall CI support run carbondata based on multi version spark?

2016-10-12 Thread zhujin
One issue:
I modified the spark.version in pom.xml,using spark1.6.2, then compliation 
failed.


Root cause:
There was a "unused import statement" warinng in CarbonOptimizer class before, 
we imported AggregationExpression like the following : 
import org.apache.spark.sql.catalyst.expressions.aggregate._
import org.apache.spark.sql.catalyst.expressions._
But in spark1.6.2, AggregateExpressions is moved to subpackage "aggregate" from 
"expressions"(that is for spark1.5.2).
So if we didn't known this change, we removed this import "import 
org.apache.spark.sql.catalyst.expressions.aggregate._", it will cause 
compliation failure when using spark1.6.2


Question:
So, maybe the CI should verify carbondata based on different version spark, 
then it will be more helful to check the correctness of the commits, shall we?

[jira] [Created] (CARBONDATA-309) Support two types of ReadSupport in CarbonRecordReader

2016-10-12 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-309:
---

 Summary: Support two types of ReadSupport in CarbonRecordReader
 Key: CARBONDATA-309
 URL: https://issues.apache.org/jira/browse/CARBONDATA-309
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Jacky Li


CarbonRecordReader should support late decode based on passed Configuration
A config indicating late decode need to be added in CarbonInputFormat for this 
purpose. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-308) Support multiple segment in CarbonHadoopFSRDD

2016-10-12 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-308:
---

 Summary: Support multiple segment in CarbonHadoopFSRDD
 Key: CARBONDATA-308
 URL: https://issues.apache.org/jira/browse/CARBONDATA-308
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Jacky Li






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-307) Support full functionality in CarbonInputFormat

2016-10-12 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-307:
---

 Summary: Support full functionality in CarbonInputFormat
 Key: CARBONDATA-307
 URL: https://issues.apache.org/jira/browse/CARBONDATA-307
 Project: CarbonData
  Issue Type: Improvement
  Components: spark-integration
Affects Versions: 0.1.0-incubating
Reporter: Jacky Li
 Fix For: 0.2.0-incubating


Currently, there are two read path in carbon-spark module: 
1. CarbonContext => CarbonDatasourceRelation => CarbonScanRDD => QueryExecutor
In this case, CarbonScanRDD uses CarbonInputFormat to get the split, and use 
QueryExecutor for scan.

2. SqlContext => CarbonDatasourceHadoopRelation => CarbonHadoopFSRDD => 
CarbonRecordReader
In this case, CarbonHadoopFSRDD uses CarbonInputFormat to do both get split and 
scan

It create unnecessary duplicate code, they need to be unified.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-carbondata pull request #223: [CARBONDATA-292] add infomation for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/223#discussion_r83133189
  
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load 
data:
  'MULTILINE'='true', 'ESCAPECHAR'='\', 
  'COMPLEX_DELIMITER_LEVEL_1'='$', 
  'COMPLEX_DELIMITER_LEVEL_2'=':',
- 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+ 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+ 
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv, 
empname:/dictFilePath/empnameDict.csv'
--- End diff --

No, I mean just delete it from the Example section. 
And that node should be added in the option expanation section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #223: [CARBONDATA-292] add infomation for ...

2016-10-12 Thread Jay357089
Github user Jay357089 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/223#discussion_r83132221
  
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load 
data:
  'MULTILINE'='true', 'ESCAPECHAR'='\', 
  'COMPLEX_DELIMITER_LEVEL_1'='$', 
  'COMPLEX_DELIMITER_LEVEL_2'=':',
- 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+ 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+ 
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv, 
empname:/dictFilePath/empnameDict.csv'
--- End diff --

i have give Note in the below.  if it is not enough, should i delete this 
option or close this pr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130319
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

It should be called in both failure and success cases. So i will rename it 
to `close`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130123
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129418
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
+
+public class CarbonDataLoadConfiguration {
--- End diff --

Ok. I will keep the configuration but only keep important and global 
configurations I will keep in it, remaining configurations we can move to `Map` 
and keep it inside configuartion it self.  what do you say?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129008
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion(New feature) regarding single pass data loading solution.

2016-10-12 Thread Qingqing Zhou
On Tue, Oct 11, 2016 at 2:32 AM, Ravindra Pesala 
wrote:
> Currently data is loading to carbon in 2 pass/jobs
>  1. Generating global dictionary using spark job.

Do we have local dictionaries? If not, what if the column has many
distinct values - will the big dictionary loaded into memory?

Regards,
Qingqing


[GitHub] incubator-carbondata pull request #212: [CARBONDATA-285] Use path parameter ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/212#discussion_r83123943
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/CarbonDatasourceRelation.scala
 ---
@@ -55,18 +55,11 @@ class CarbonSource extends RelationProvider
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-// if path is provided we can directly create Hadoop relation. \
-// Otherwise create datasource relation
-parameters.get("path") match {
-  case Some(path) => CarbonDatasourceHadoopRelation(sqlContext, 
Array(path), parameters, None)
-  case _ =>
-val options = new CarbonOption(parameters)
-val tableIdentifier = options.tableIdentifier.split("""\.""").toSeq
-val identifier = tableIdentifier match {
-  case Seq(name) => TableIdentifier(name, None)
-  case Seq(db, name) => TableIdentifier(name, Some(db))
-}
-CarbonDatasourceRelation(identifier, None)(sqlContext)
+val options = new CarbonOption(parameters)
+if (sqlContext.isInstanceOf[CarbonContext]) {
--- End diff --

There is no `load` method in dataframe, only in context class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion(New feature) regarding single pass data loading solution.

2016-10-12 Thread Aniket Adnaik
Hi Ravi,

1. I agree with Jihong that creation of global dictionary should be
optional, so that it can be disabled to improve the load performance. User
should be made aware that using global dictionary may boost the query
performance.
2. We should have a generic interface to manage global dictionary when its
from external sources. In general, it is not a good idea to depend on too
many external tools.
3. May be we should allow user to generate global dictionary separately
through SQL command or similar. Something like materialized view. This
means carbon should avoid using local dictionary and do late
materialization when global dictionary is present.
4. May be we should think of some ways to create global dictionary lazily
as we serve SELECT queries. Implementation may not be that straight
forward. Not sure if its worth the effort.

Best Regards,
Aniket


On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma  wrote:

>
> A rather straight option is allow user to supply global dictionary
> generated somewhere else or we build a separate tool just for generating as
> well updating dictionary. Then the general normal data loading process will
> encode columns with local dictionary if not supplied.  This should cover
> majority of cases for low-medium cardinality column. For the cases we have
> to incorporate online dictionary update, use a lock mechanism to sync up
> should serve the purpose.
>
> In another words, generating global dictionary is an optional step, only
> triggered when needed, not a default step as we do currently.
>
> Jihong
>
> -Original Message-
> From: Ravindra Pesala [mailto:ravi.pes...@gmail.com]
> Sent: Tuesday, October 11, 2016 2:33 AM
> To: dev
> Subject: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi All,
>
> This discussion is regarding single pass data load solution.
>
> Currently data is loading to carbon in 2 pass/jobs
>  1. Generating global dictionary using spark job.
>  2. Encode the data with dictionary values and create carbondata files.
> This 2 pass solution has many disadvantages like it needs to read the data
> twice in case of csv files input or it needs to execute dataframe twice if
> data is loaded from dataframe.
>
> In order to overcome from above issues of 2 pass dataloading, we can have
> single pass dataloading and following are the alternate solutions.
>
> Use local dictionary
>  Use local dictionary for each carbondata file while loading data, but it
> may lead to query performance degradation and more memory footprint.
>
> Use KV store/distributed map.
> *HBase/Cassandra cluster : *
>   Dictionary data would be stored in KV store and generates the dictionary
> value if it is not present in it. We all know the pros/cons of Hbase but
> following are few.
>   Pros : These are apache licensed
>  Easy to implement to store/retreive dictionary values.
>  Performance need to be evaluated.
>
>   Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>
> *Hazlecast distributed map : *
>   Dictionary data could be saved in distributed concurrent hash map of
> hazlecast. It is in-memory map and partioned as per number of nodes. And
> even we can maintain the backups using sync/async functionality to avoid
> the data loss when instance is down. We no need to maintain seperate
> cluster for it as it can run on executor jvm itself.
>   Pros: It is apache licensed.
> No need to maintain seperate cluster as instances can run in
> executor jvms.
> Easy to implement and store/retreive dictionary values.
> It is pure java implementation.
> There is no master/slave concept and no single point failure.
>
>   Cons: Performance need to be evaluated.
>
> *Redis distributed map : *
> It is also in-memory map but it is coded in c language so we should
> have java client libraries to interact with redis. Need to maintain
> seperate cluster for it. It also can partition the data.
>   Pros : More feature rich than Hazlecast.
>  Easy to implement and store/retreive dictionary values.
>   Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>  May not be suitable for big data stack.
>  It is BSD licensed (Not sure whether we can use or not)
>   Online performance figures says it is little slower than hazlecast.
>
> Please let me know which would be best fit for our loading solution. And
> please add any other suitable solution if I missed.
> --
> Thanks & Regards,
> Ravi
>


Re: [Discussion] Code generation in carbon result preparation

2016-10-12 Thread Jacky Li
Hi Vishal,

Which part of the preparation are you considering? The column stitching in the 
executor side?

Regards,
Jacky

> 在 2016年10月12日,下午9:24,Kumar Vishal  写道:
> 
> Hi All,
> Currently we are preparing the final result row wise, as number of columns
> present in project list(80 columns) is high mainly measure column or no
> dictionary column there are lots of cpu cache invalidation is happening and
> this is resulting to slower the query performance.
> 
> *I can think of two solutions for this problem.*
> *Solution 1*. Fill column data vertically, currently it is horizontally(It
> may not solve all the problem)
> *Solution 2*. Use code generation for result preparation.
> 
> This is an initially idea.
> 
> -Regards
> Kumar Vishal





[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-10-12 Thread lion-x
Github user lion-x commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r83039531
  
--- Diff: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala
 ---
@@ -0,0 +1,78 @@
+package org.apache.carbondata.spark.testsuite.dataload
+
+import java.io.File
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.spark.sql.common.util.CarbonHiveContext._
+import org.apache.spark.sql.common.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+import org.apache.spark.sql.Row
+
+/**
+  * Created by x00381807 on 2016/9/26.
--- End diff --

Oh, my fault


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032371
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
--- End diff --

If there is a abstract class, it can have the child as its member variable, 
then this `initialize` function takes no parameter as input


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #205: [CARBONDATA-281]Vt enhancement for t...

2016-10-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/205


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83031958
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --

I think each implementation of this interface have similar logic in the 
execute function, can we create a abstract class to implement the common logic? 
The common logic like:
```
Iterator execute() throws CarbonDataLoadingException {
Iterator childIter = child.execute();
return new Iterator {
  public boolean hasNext() { 
return childIter.hasNext();
  }
  public Object[] next() {
// processInput is the abstract func in this class
return processInput(childItor.next());
  }
 }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83028336
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --

add license header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033524
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

what is it used for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033298
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

why this is carbon-processing but not carbon-hadoop module?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032703
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

This is called when the step successfully finished. But what about failure 
case, should there be a 
`void close();` 
interface for failure case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #230: [CARBONDATA-306]Add block size info ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83028164
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --

Suggest to have the `blockSize` convert to a proper number before logging 
it, otherwise it is hard to check this value by human


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #230: [CARBONDATA-306]Add block size info ...

2016-10-12 Thread Zhangshunyu
Github user Zhangshunyu commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83027603
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --

@jackylk set in mb,but here already converted to byte.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r82523962
  
--- Diff: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala
 ---
@@ -0,0 +1,78 @@
+package org.apache.carbondata.spark.testsuite.dataload
+
+import java.io.File
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.spark.sql.common.util.CarbonHiveContext._
+import org.apache.spark.sql.common.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+import org.apache.spark.sql.Row
+
+/**
+  * Created by x00381807 on 2016/9/26.
--- End diff --

please remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #230: [CARBONDATA-306]Add block size info ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83021340
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --

Is `blockSize` in bytes or MB?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #204: [CARBONDATA-280]Fix the bug that whe...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/204#discussion_r83027008
  
--- Diff: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/deleteTable/TestDeleteTableNewDDL.scala
 ---
@@ -97,7 +97,7 @@ class TestDeleteTableNewDDL extends QueryTest with 
BeforeAndAfterAll {
   "CREATE table CaseInsensitiveTable (ID int, date String, country 
String, name " +
   "String," +
   "phonetype String, serialname String, salary int) stored by 
'org.apache.carbondata.format'" +
-  "TBLPROPERTIES('DICTIONARY_INCLUDE'='ID', 
'DICTIONARY_INCLUDE'='salary')"
+  "TBLPROPERTIES('DICTIONARY_INCLUDE'='ID,salary')"
--- End diff --

add space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #207: [CARBONDATA-283] VT enhancement for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/207#discussion_r83025827
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/util/CarbonTableStatusUtil.java
 ---
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.processing.util;
+
+import java.text.SimpleDateFormat;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Date;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.load.LoadMetadataDetails;
+
+/**
+ * This class contains all table status file utilities
+ */
+public final class CarbonTableStatusUtil {
+  private static final LogService LOGGER =
+  
LogServiceFactory.getLogService(CarbonTableStatusUtil.class.getName());
+
+  private CarbonTableStatusUtil() {
+
+  }
+
+  /**
+   * updates table status details using latest metadata
+   *
+   * @param oldMetadata
+   * @param newMetadata
+   * @return
+   */
+
+  public static List updateLatestTableStatusDetails(
+  LoadMetadataDetails[] oldMetadata, LoadMetadataDetails[] 
newMetadata) {
+
+List newListMetadata =
+new ArrayList(Arrays.asList(newMetadata));
+for (LoadMetadataDetails oldSegment : oldMetadata) {
+  if 
(CarbonCommonConstants.MARKED_FOR_DELETE.equalsIgnoreCase(oldSegment.getLoadStatus()))
 {
+
updateSegmentMetadataDetails(newListMetadata.get(newListMetadata.indexOf(oldSegment)));
+  }
+}
+return newListMetadata;
+  }
+
+  /**
+   * returns current time
+   *
+   * @return
+   */
+  private static String readCurrentTime() {
+SimpleDateFormat sdf = new 
SimpleDateFormat(CarbonCommonConstants.CARBON_TIMESTAMP);
+String date = null;
+
+date = sdf.format(new Date());
+
+return date;
+  }
+
+  /**
+   * updates segment status and modificaton time details
+   *
+   * @param loadMetadata
+   */
+  public static void updateSegmentMetadataDetails(LoadMetadataDetails 
loadMetadata) {
--- End diff --

Can you improve on the function name to depict the behavior of this 
function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #223: [CARBONDATA-292] add infomation for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/223#discussion_r83022245
  
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load 
data:
  'MULTILINE'='true', 'ESCAPECHAR'='\', 
  'COMPLEX_DELIMITER_LEVEL_1'='$', 
  'COMPLEX_DELIMITER_LEVEL_2'=':',
- 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+ 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+ 
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv, 
empname:/dictFilePath/empnameDict.csv'
--- End diff --

do not give this option since it can not be used together with 
`ALL_DICTIONARY_PATH`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83018479
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

I think `execute()` is called for every parallel unit of the input, right? 
For example, when using spark to load from dataframe, `execute()` is called for 
every spark partition (execute one task for one partition). When loading from 
CSV HDFS file, `execute()` is called for every HDFS block. So I do not think 
returning array of iterator is required.

The loading process of carbon in every executor, some of the step can be 
parallelized, but sort step need to be synchronized (potential bottle net), 
since we need datanode-scope sorting. Am I correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #218: [CARBONDATA-288] In hdfs bad record ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/218#discussion_r83014871
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/BadRecordslogger.java
 ---
@@ -69,9 +68,13 @@
   private BufferedWriter bufferedCSVWriter;
   private DataOutputStream outCSVStream;
   /**
-   *
+   * bad record log file path
+   */
+  private String logFilePath;
+  /**
+   * csv file path
*/
-  private CarbonFile logFile;
+  private String csvFilePath;
--- End diff --

What is this csv file? What is the difference from logFilePath?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #218: [CARBONDATA-288] In hdfs bad record ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/218#discussion_r83015590
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java
 ---
@@ -458,9 +462,11 @@ public boolean processRow(StepMetaInterface smi, 
StepDataInterface sdi) throws K
   break;
 case REDIRECT:
   badRecordsLogRedirect = true;
+  badRecordConvertNullDisable= true;
--- End diff --

add space before `=`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #218: [CARBONDATA-288] In hdfs bad record ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/218#discussion_r83014256
  
--- Diff: 
integration/spark/src/main/java/org/apache/carbondata/spark/load/CarbonLoadModel.java
 ---
@@ -117,9 +117,9 @@
   private String badRecordsLoggerEnable;
 
   /**
-   * defines the option to specify the bad record log redirect to raw csv
+   * defines the option to specify the bad record logger action
*/
-  private String badRecordsLoggerRedirect;
+  private String badRecordsLoggerAction;
--- End diff --

This action is not for Logger, right? Perhaps `badRecordsAction` is a 
better name?
And it should be an enum instead of String


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #227: [CARBONDATA-304] Fixed data loading ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/227#discussion_r83012391
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -197,8 +197,9 @@ public AbstractFactDataWriter(String storeLocation, int 
measureCount, int mdKeyL
 blockIndexInfoList = new ArrayList<>();
 // get max file size;
 CarbonProperties propInstance = CarbonProperties.getInstance();
-this.fileSizeInBytes = blocksize * 
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR
-* CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR * 1L;
+// if blocksize=2048, then 2048*1024*1024 will beyond the range of Int
+this.fileSizeInBytes = 1L * blocksize * 
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR
--- End diff --

instead of multiple by `1L`, you can just convert `blocksize` to 
`(long)blocksize` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #224: [CARBONDATA-239]Add scan_blocklet_nu...

2016-10-12 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/224#discussion_r82983904
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/scanner/impl/FilterScanner.java 
---
@@ -78,10 +80,11 @@ public FilterScanner(BlockExecutionInfo 
blockExecutionInfo) {
* @throws QueryExecutionException
* @throws FilterUnsupportedException
*/
-  @Override public AbstractScannedResult scanBlocklet(BlocksChunkHolder 
blocksChunkHolder)
+  @Override public AbstractScannedResult scanBlocklet(BlocksChunkHolder 
blocksChunkHolder,
+  QueryStatisticsModel 
queryStatisticsModel)
   throws QueryExecutionException {
 try {
-  fillScannedResult(blocksChunkHolder);
+  fillScannedResult(blocksChunkHolder, queryStatisticsModel);
--- End diff --

Pass the model in constructor so that no need to change in all API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-10-12 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r82977592
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
 ---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
 parserSettings.setMaxColumns(
 getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), 
csvParserVo.getMaxColumns()));
 parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --

pros of this approach will be suppose in one load user loaded with dirty 
data and suddenly he realizes no i need to trim then in the next load he will 
enable the option and load the data, this will increase the dictionary space 
also, also in query dictionary lookup overhead will increase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion regrading design of data load after kettle removal.

2016-10-12 Thread Ravindra Pesala
Hi Jacky,

1. Yes. It is better to keep all sorting logic to one step so other types
of sorts can be implemented easily. I will update the design.

2. EncoderProcessorStep can do dictionary encoding and converting
nodictionary and complex types to byte[] representation.
Here encoding interface is flexible for user to give different encoding
representation at row level only.
And about RLE, DELTA and also heavy compression are done at
DataWriterProcessorStep only, it is because these
 encodings/compression happens at bloclklet level not row level.

3. Yes, each step requires schema definition, that will be passed as
DataField[] through configuration to initial step InputProcessorStep .
Remaining steps can call child.getOutput() to get the schema. Here
each DataField
represents one column.

Regards,
Ravi

On 12 October 2016 at 09:38, Jacky Li  wrote:

> Hi Ravindra,
>
> Regarding the design
> (https://drive.google.com/file/d/0B4TWTVbFSTnqTF85anlDOUQ5S1BqY
> zFpLWcwZnBLSVVqSWpj/view),
> I have following question:
>
> 1. In SortProcessorStep, I think it is better to include MergeSort in this
> step also, so it includes all logic for sorting. In this case, developer
> can
> implement a external sort (spill to files only if necessary), then the
> loading process is a on-line sorting if memory is sufficient. I think it
> will improve loading performance a lot.
>
> 2. In EncoderProcessorStep, apart from the dictionary encoding, what other
> processing it will do? How about delta, RLE, etc.
>
> 3. In InputProcessorStep, it needs some schema definition to parse the
> input
> and convert to the row, right? For example, how to read from JSON, AVRO
> file?
>
> Regards,
> Jacky
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> regrading-design-of-data-load-after-kettle-removal-tp1672p1783.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi


[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-10-12 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r82968468
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
 ---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
 parserSettings.setMaxColumns(
 getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), 
csvParserVo.getMaxColumns()));
 parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --

Guys i think if while data loading we are reading from configuration 
inorder to trim it or not same we need to do while doing filter also, based on 
configuration value decide.
ex:  while loading i enabled trim property, so system will trim and load 
the data, now in filter query also if user provides while space then it needs 
to be trimmed while creating filter model. This will provide more system 
consistentency. if user enable trim then we wont trim it while loading and also 
while querying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

2016-10-12 Thread lion-x
Github user lion-x commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/200#discussion_r82960653
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
 ---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
 parserSettings.setMaxColumns(
 getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), 
csvParserVo.getMaxColumns()));
 parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --

hi sujith,
I agree with Eason, when user query with some space filter, it should be 
considered as forbidden action.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #224: [CARBONDATA-239]Add scan_blocklet_nu...

2016-10-12 Thread Zhangshunyu
Github user Zhangshunyu commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/224#discussion_r82958230
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/processor/AbstractDataBlockIterator.java
 ---
@@ -127,11 +133,15 @@ protected boolean updateScanner() {
 }
   }
 
-  private AbstractScannedResult getNextScannedResult() throws 
QueryExecutionException {
+  private AbstractScannedResult 
getNextScannedResult(QueryStatisticsRecorder recorder,
--- End diff --

@sujith71955 OK, i will use a statistics model, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---