Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147018
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+ /**
+ * The output meta for this step. The data returns from this step is as
per this meta.
+ * @return
+ */
+ DataField[] getOutput();
+
+ /**
+ * Intialization process for this step.
+ * @param configuration
+ * @param child
+ * @throws CarbonDataLoadingException
+ */
+ void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws
+ CarbonDataLoadingException;
+
+ /**
+ * Tranform the data as per the implemetation.
+ * @return Iterator of data
+ * @throws CarbonDataLoadingException
+ */
+ Iterator
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/212#discussion_r83146927
--- Diff:
integration/spark/src/main/scala/org/apache/spark/sql/CarbonDatasourceRelation.scala
---
@@ -55,18 +55,11 @@ class CarbonSource extends RelationProvider
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
-// if path is provided we can directly create Hadoop relation. \
-// Otherwise create datasource relation
-parameters.get("path") match {
- case Some(path) => CarbonDatasourceHadoopRelation(sqlContext,
Array(path), parameters, None)
- case _ =>
-val options = new CarbonOption(parameters)
-val tableIdentifier = options.tableIdentifier.split("""\.""").toSeq
-val identifier = tableIdentifier match {
- case Seq(name) => TableIdentifier(name, None)
- case Seq(db, name) => TableIdentifier(name, Some(db))
-}
-CarbonDatasourceRelation(identifier, None)(sqlContext)
+val options = new CarbonOption(parameters)
+if (sqlContext.isInstanceOf[CarbonContext]) {
--- End diff --
sorry, yes `carboncontext.load(path)` cannot work now right?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user Zhangshunyu commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/230#discussion_r83139950
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long
blockSize, long fileSize) {
if (remainder > 0) {
maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
}
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --
@Jay357089 I think this is a good idea to extract ConvertByteToReadable as
a method, since it can be used in many logs, especially for analyzing
performance.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user Jay357089 commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/230#discussion_r83139600
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long
blockSize, long fileSize) {
if (remainder > 0) {
maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
}
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --
@jackylk Maybe i should extract if .. else part to a method called
ConvertByteToReadable, what's your opinion?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
GitHub user foryou2030 opened a pull request:
https://github.com/apache/incubator-carbondata/pull/232
[CARBONDATA-310]Fixed compilation failure when using spark 1.6.2
# Why raise this pr?
Compilation failed when using spark 1.6.2, because class not found:
AggregateExpression
# How to solve?
Once Removing the import "import
org.apache.spark.sql.catalyst.expressions.aggregate._" will cause compilation
failure when using Spark 1.6.2, in which AggregateExpression is moved to
subpackage "aggregate". So neeed changing it back.
Thanks for you remind, @harperjiang
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/foryou2030/incubator-carbondata agg_ex
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-carbondata/pull/232.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #232
commit ee4f6832d893c6ac99e1694b607b6f2d38ec9231
Author: foryou2030
Date: 2016-10-13T03:17:38Z
fix compile on spark1.6.2
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
GitHub user Zhangshunyu opened a pull request:
https://github.com/apache/incubator-carbondata/pull/231
[CARBONDATA-311]Log the data size of blocklet during data load.
## Why raise this pr?
The blocklet size is an important parameter for analyzing data load and
query, this info should be logged.
## How to test?
Pass all the test case.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Zhangshunyu/incubator-carbondata logblocklet
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-carbondata/pull/231.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #231
commit a110504f58e688e42223e896f7a1cf729463cf9d
Author: Zhangshunyu
Date: 2016-10-13T03:17:21Z
Log the data size of each blocklet
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
zhangshunyu created CARBONDATA-311:
--
Summary: Log the data size of blocklet during data load.
Key: CARBONDATA-311
URL: https://issues.apache.org/jira/browse/CARBONDATA-311
Project: CarbonData
Issue Type: Improvement
Affects Versions: 0.1.1-incubating
Reporter: zhangshunyu
Assignee: zhangshunyu
Priority: Minor
Fix For: 0.2.0-incubating
Log the data size of blocklet during data load.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Jacky Li created CARBONDATA-312:
---
Summary: Unify two datasource: CarbonDatasourceHadoopRelation and
CarbonDatasourceRelation
Key: CARBONDATA-312
URL: https://issues.apache.org/jira/browse/CARBONDATA-312
Project: CarbonData
Issue Type: Sub-task
Reporter: Jacky Li
Take CarbonDatasourceHadoopRelation as the target datasource definition, after
that, CarbonContext can use standard Datasource strategy
In this Issue, following change is required:
1. Move the dictionary stratey out of CarbonTableScan, make a separate strategy
for it.
2. CarbonDatasourceHadoopRelation should use CarbonScanRDD in buildScan
function.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
One issue:
I modified the spark.version in pom.xml,using spark1.6.2, then compliation
failed.
Root cause:
There was a "unused import statement" warinng in CarbonOptimizer class before,
we imported AggregationExpression like the following :
import org.apache.spark.sql.catalyst.expressions.aggregate._
import org.apache.spark.sql.catalyst.expressions._
But in spark1.6.2, AggregateExpressions is moved to subpackage "aggregate" from
"expressions"(that is for spark1.5.2).
So if we didn't known this change, we removed this import "import
org.apache.spark.sql.catalyst.expressions.aggregate._", it will cause
compliation failure when using spark1.6.2
Question:
So, maybe the CI should verify carbondata based on different version spark,
then it will be more helful to check the correctness of the commits, shall we?
Jacky Li created CARBONDATA-309:
---
Summary: Support two types of ReadSupport in CarbonRecordReader
Key: CARBONDATA-309
URL: https://issues.apache.org/jira/browse/CARBONDATA-309
Project: CarbonData
Issue Type: Sub-task
Reporter: Jacky Li
CarbonRecordReader should support late decode based on passed Configuration
A config indicating late decode need to be added in CarbonInputFormat for this
purpose.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Jacky Li created CARBONDATA-308:
---
Summary: Support multiple segment in CarbonHadoopFSRDD
Key: CARBONDATA-308
URL: https://issues.apache.org/jira/browse/CARBONDATA-308
Project: CarbonData
Issue Type: Sub-task
Reporter: Jacky Li
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Jacky Li created CARBONDATA-307:
---
Summary: Support full functionality in CarbonInputFormat
Key: CARBONDATA-307
URL: https://issues.apache.org/jira/browse/CARBONDATA-307
Project: CarbonData
Issue Type: Improvement
Components: spark-integration
Affects Versions: 0.1.0-incubating
Reporter: Jacky Li
Fix For: 0.2.0-incubating
Currently, there are two read path in carbon-spark module:
1. CarbonContext => CarbonDatasourceRelation => CarbonScanRDD => QueryExecutor
In this case, CarbonScanRDD uses CarbonInputFormat to get the split, and use
QueryExecutor for scan.
2. SqlContext => CarbonDatasourceHadoopRelation => CarbonHadoopFSRDD =>
CarbonRecordReader
In this case, CarbonHadoopFSRDD uses CarbonInputFormat to do both get split and
scan
It create unnecessary duplicate code, they need to be unified.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/223#discussion_r83133189
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load
data:
'MULTILINE'='true', 'ESCAPECHAR'='\',
'COMPLEX_DELIMITER_LEVEL_1'='$',
'COMPLEX_DELIMITER_LEVEL_2'=':',
-
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv,
empname:/dictFilePath/empnameDict.csv'
--- End diff --
No, I mean just delete it from the Example section.
And that node should be added in the option expanation section.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user Jay357089 commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/223#discussion_r83132221
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load
data:
'MULTILINE'='true', 'ESCAPECHAR'='\',
'COMPLEX_DELIMITER_LEVEL_1'='$',
'COMPLEX_DELIMITER_LEVEL_2'=':',
-
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv,
empname:/dictFilePath/empnameDict.csv'
--- End diff --
i have give Note in the below. if it is not enough, should i delete this
option or close this pr?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130319
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+ /**
+ * The output meta for this step. The data returns from this step is as
per this meta.
+ * @return
+ */
+ DataField[] getOutput();
+
+ /**
+ * Intialization process for this step.
+ * @param configuration
+ * @param child
+ * @throws CarbonDataLoadingException
+ */
+ void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws
+ CarbonDataLoadingException;
+
+ /**
+ * Tranform the data as per the implemetation.
+ * @return Iterator of data
+ * @throws CarbonDataLoadingException
+ */
+ Iterator
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130123
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --
ok
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129418
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
+
+public class CarbonDataLoadConfiguration {
--- End diff --
Ok. I will keep the configuration but only keep important and global
configurations I will keep in it, remaining configurations we can move to `Map`
and keep it inside configuartion it self. what do you say?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129008
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --
ok
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
On Tue, Oct 11, 2016 at 2:32 AM, Ravindra Pesala
wrote:
> Currently data is loading to carbon in 2 pass/jobs
> 1. Generating global dictionary using spark job.
Do we have local dictionaries? If not, what if the column has many
distinct values - will the big dictionary loaded into memory?
Regards,
Qingqing
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/212#discussion_r83123943
--- Diff:
integration/spark/src/main/scala/org/apache/spark/sql/CarbonDatasourceRelation.scala
---
@@ -55,18 +55,11 @@ class CarbonSource extends RelationProvider
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
-// if path is provided we can directly create Hadoop relation. \
-// Otherwise create datasource relation
-parameters.get("path") match {
- case Some(path) => CarbonDatasourceHadoopRelation(sqlContext,
Array(path), parameters, None)
- case _ =>
-val options = new CarbonOption(parameters)
-val tableIdentifier = options.tableIdentifier.split("""\.""").toSeq
-val identifier = tableIdentifier match {
- case Seq(name) => TableIdentifier(name, None)
- case Seq(db, name) => TableIdentifier(name, Some(db))
-}
-CarbonDatasourceRelation(identifier, None)(sqlContext)
+val options = new CarbonOption(parameters)
+if (sqlContext.isInstanceOf[CarbonContext]) {
--- End diff --
There is no `load` method in dataframe, only in context class.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Hi Ravi,
1. I agree with Jihong that creation of global dictionary should be
optional, so that it can be disabled to improve the load performance. User
should be made aware that using global dictionary may boost the query
performance.
2. We should have a generic interface to manage global dictionary when its
from external sources. In general, it is not a good idea to depend on too
many external tools.
3. May be we should allow user to generate global dictionary separately
through SQL command or similar. Something like materialized view. This
means carbon should avoid using local dictionary and do late
materialization when global dictionary is present.
4. May be we should think of some ways to create global dictionary lazily
as we serve SELECT queries. Implementation may not be that straight
forward. Not sure if its worth the effort.
Best Regards,
Aniket
On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma wrote:
>
> A rather straight option is allow user to supply global dictionary
> generated somewhere else or we build a separate tool just for generating as
> well updating dictionary. Then the general normal data loading process will
> encode columns with local dictionary if not supplied. This should cover
> majority of cases for low-medium cardinality column. For the cases we have
> to incorporate online dictionary update, use a lock mechanism to sync up
> should serve the purpose.
>
> In another words, generating global dictionary is an optional step, only
> triggered when needed, not a default step as we do currently.
>
> Jihong
>
> -Original Message-
> From: Ravindra Pesala [mailto:ravi.pes...@gmail.com]
> Sent: Tuesday, October 11, 2016 2:33 AM
> To: dev
> Subject: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi All,
>
> This discussion is regarding single pass data load solution.
>
> Currently data is loading to carbon in 2 pass/jobs
> 1. Generating global dictionary using spark job.
> 2. Encode the data with dictionary values and create carbondata files.
> This 2 pass solution has many disadvantages like it needs to read the data
> twice in case of csv files input or it needs to execute dataframe twice if
> data is loaded from dataframe.
>
> In order to overcome from above issues of 2 pass dataloading, we can have
> single pass dataloading and following are the alternate solutions.
>
> Use local dictionary
> Use local dictionary for each carbondata file while loading data, but it
> may lead to query performance degradation and more memory footprint.
>
> Use KV store/distributed map.
> *HBase/Cassandra cluster : *
> Dictionary data would be stored in KV store and generates the dictionary
> value if it is not present in it. We all know the pros/cons of Hbase but
> following are few.
> Pros : These are apache licensed
> Easy to implement to store/retreive dictionary values.
> Performance need to be evaluated.
>
> Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>
> *Hazlecast distributed map : *
> Dictionary data could be saved in distributed concurrent hash map of
> hazlecast. It is in-memory map and partioned as per number of nodes. And
> even we can maintain the backups using sync/async functionality to avoid
> the data loss when instance is down. We no need to maintain seperate
> cluster for it as it can run on executor jvm itself.
> Pros: It is apache licensed.
> No need to maintain seperate cluster as instances can run in
> executor jvms.
> Easy to implement and store/retreive dictionary values.
> It is pure java implementation.
> There is no master/slave concept and no single point failure.
>
> Cons: Performance need to be evaluated.
>
> *Redis distributed map : *
> It is also in-memory map but it is coded in c language so we should
> have java client libraries to interact with redis. Need to maintain
> seperate cluster for it. It also can partition the data.
> Pros : More feature rich than Hazlecast.
> Easy to implement and store/retreive dictionary values.
> Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
> May not be suitable for big data stack.
> It is BSD licensed (Not sure whether we can use or not)
> Online performance figures says it is little slower than hazlecast.
>
> Please let me know which would be best fit for our loading solution. And
> please add any other suitable solution if I missed.
> --
> Thanks & Regards,
> Ravi
>
Hi Vishal,
Which part of the preparation are you considering? The column stitching in the
executor side?
Regards,
Jacky
> 在 2016年10月12日,下午9:24,Kumar Vishal 写道:
>
> Hi All,
> Currently we are preparing the final result row wise, as number of columns
> present in project list(80 columns) is high mainly measure column or no
> dictionary column there are lots of cpu cache invalidation is happening and
> this is resulting to slower the query performance.
>
> *I can think of two solutions for this problem.*
> *Solution 1*. Fill column data vertically, currently it is horizontally(It
> may not solve all the problem)
> *Solution 2*. Use code generation for result preparation.
>
> This is an initially idea.
>
> -Regards
> Kumar Vishal
Github user lion-x commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/200#discussion_r83039531
--- Diff:
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala
---
@@ -0,0 +1,78 @@
+package org.apache.carbondata.spark.testsuite.dataload
+
+import java.io.File
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.spark.sql.common.util.CarbonHiveContext._
+import org.apache.spark.sql.common.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+import org.apache.spark.sql.Row
+
+/**
+ * Created by x00381807 on 2016/9/26.
--- End diff --
Oh, my fault
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032371
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+ /**
+ * The output meta for this step. The data returns from this step is as
per this meta.
+ * @return
+ */
+ DataField[] getOutput();
+
+ /**
+ * Intialization process for this step.
+ * @param configuration
+ * @param child
+ * @throws CarbonDataLoadingException
+ */
+ void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws
--- End diff --
If there is a abstract class, it can have the child as its member variable,
then this `initialize` function takes no parameter as input
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user asfgit closed the pull request at:
https://github.com/apache/incubator-carbondata/pull/205
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83031958
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --
I think each implementation of this interface have similar logic in the
execute function, can we create a abstract class to implement the common logic?
The common logic like:
```
Iterator
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83028336
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --
add license header
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033524
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --
what is it used for?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033298
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --
why this is carbon-processing but not carbon-hadoop module?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032703
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+ /**
+ * The output meta for this step. The data returns from this step is as
per this meta.
+ * @return
+ */
+ DataField[] getOutput();
+
+ /**
+ * Intialization process for this step.
+ * @param configuration
+ * @param child
+ * @throws CarbonDataLoadingException
+ */
+ void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws
+ CarbonDataLoadingException;
+
+ /**
+ * Tranform the data as per the implemetation.
+ * @return Iterator of data
+ * @throws CarbonDataLoadingException
+ */
+ Iterator execute() throws CarbonDataLoadingException;
+
+ /**
+ * Any closing of resources after step execution can be done here.
+ */
+ void finish();
--- End diff --
This is called when the step successfully finished. But what about failure
case, should there be a
`void close();`
interface for failure case?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/230#discussion_r83028164
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long
blockSize, long fileSize) {
if (remainder > 0) {
maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
}
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --
Suggest to have the `blockSize` convert to a proper number before logging
it, otherwise it is hard to check this value by human
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user Zhangshunyu commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/230#discussion_r83027603
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long
blockSize, long fileSize) {
if (remainder > 0) {
maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
}
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --
@jackylk set in mbï¼but here already converted to byte.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/200#discussion_r82523962
--- Diff:
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala
---
@@ -0,0 +1,78 @@
+package org.apache.carbondata.spark.testsuite.dataload
+
+import java.io.File
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.spark.sql.common.util.CarbonHiveContext._
+import org.apache.spark.sql.common.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+import org.apache.spark.sql.Row
+
+/**
+ * Created by x00381807 on 2016/9/26.
--- End diff --
please remove
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/230#discussion_r83021340
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -252,6 +252,9 @@ private static long getMaxOfBlockAndFileSize(long
blockSize, long fileSize) {
if (remainder > 0) {
maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
}
+LOGGER.info("The configured block size is " + blockSize + " byte, " +
--- End diff --
Is `blockSize` in bytes or MB?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/204#discussion_r83027008
--- Diff:
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/deleteTable/TestDeleteTableNewDDL.scala
---
@@ -97,7 +97,7 @@ class TestDeleteTableNewDDL extends QueryTest with
BeforeAndAfterAll {
"CREATE table CaseInsensitiveTable (ID int, date String, country
String, name " +
"String," +
"phonetype String, serialname String, salary int) stored by
'org.apache.carbondata.format'" +
- "TBLPROPERTIES('DICTIONARY_INCLUDE'='ID',
'DICTIONARY_INCLUDE'='salary')"
+ "TBLPROPERTIES('DICTIONARY_INCLUDE'='ID,salary')"
--- End diff --
add space after `,`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/207#discussion_r83025827
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/util/CarbonTableStatusUtil.java
---
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.processing.util;
+
+import java.text.SimpleDateFormat;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Date;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.load.LoadMetadataDetails;
+
+/**
+ * This class contains all table status file utilities
+ */
+public final class CarbonTableStatusUtil {
+ private static final LogService LOGGER =
+
LogServiceFactory.getLogService(CarbonTableStatusUtil.class.getName());
+
+ private CarbonTableStatusUtil() {
+
+ }
+
+ /**
+ * updates table status details using latest metadata
+ *
+ * @param oldMetadata
+ * @param newMetadata
+ * @return
+ */
+
+ public static List updateLatestTableStatusDetails(
+ LoadMetadataDetails[] oldMetadata, LoadMetadataDetails[]
newMetadata) {
+
+List newListMetadata =
+new ArrayList(Arrays.asList(newMetadata));
+for (LoadMetadataDetails oldSegment : oldMetadata) {
+ if
(CarbonCommonConstants.MARKED_FOR_DELETE.equalsIgnoreCase(oldSegment.getLoadStatus()))
{
+
updateSegmentMetadataDetails(newListMetadata.get(newListMetadata.indexOf(oldSegment)));
+ }
+}
+return newListMetadata;
+ }
+
+ /**
+ * returns current time
+ *
+ * @return
+ */
+ private static String readCurrentTime() {
+SimpleDateFormat sdf = new
SimpleDateFormat(CarbonCommonConstants.CARBON_TIMESTAMP);
+String date = null;
+
+date = sdf.format(new Date());
+
+return date;
+ }
+
+ /**
+ * updates segment status and modificaton time details
+ *
+ * @param loadMetadata
+ */
+ public static void updateSegmentMetadataDetails(LoadMetadataDetails
loadMetadata) {
--- End diff --
Can you improve on the function name to depict the behavior of this
function?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/223#discussion_r83022245
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load
data:
'MULTILINE'='true', 'ESCAPECHAR'='\',
'COMPLEX_DELIMITER_LEVEL_1'='$',
'COMPLEX_DELIMITER_LEVEL_2'=':',
-
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv,
empname:/dictFilePath/empnameDict.csv'
--- End diff --
do not give this option since it can not be used together with
`ALL_DICTIONARY_PATH`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/229#discussion_r83018479
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+ /**
+ * The output meta for this step. The data returns from this step is as
per this meta.
+ * @return
+ */
+ DataField[] getOutput();
+
+ /**
+ * Intialization process for this step.
+ * @param configuration
+ * @param child
+ * @throws CarbonDataLoadingException
+ */
+ void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws
+ CarbonDataLoadingException;
+
+ /**
+ * Tranform the data as per the implemetation.
+ * @return Iterator of data
+ * @throws CarbonDataLoadingException
+ */
+ Iterator execute() throws CarbonDataLoadingException;
--- End diff --
I think `execute()` is called for every parallel unit of the input, right?
For example, when using spark to load from dataframe, `execute()` is called for
every spark partition (execute one task for one partition). When loading from
CSV HDFS file, `execute()` is called for every HDFS block. So I do not think
returning array of iterator is required.
The loading process of carbon in every executor, some of the step can be
parallelized, but sort step need to be synchronized (potential bottle net),
since we need datanode-scope sorting. Am I correct?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/218#discussion_r83014871
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/BadRecordslogger.java
---
@@ -69,9 +68,13 @@
private BufferedWriter bufferedCSVWriter;
private DataOutputStream outCSVStream;
/**
- *
+ * bad record log file path
+ */
+ private String logFilePath;
+ /**
+ * csv file path
*/
- private CarbonFile logFile;
+ private String csvFilePath;
--- End diff --
What is this csv file? What is the difference from logFilePath?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/218#discussion_r83015590
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java
---
@@ -458,9 +462,11 @@ public boolean processRow(StepMetaInterface smi,
StepDataInterface sdi) throws K
break;
case REDIRECT:
badRecordsLogRedirect = true;
+ badRecordConvertNullDisable= true;
--- End diff --
add space before `=`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/218#discussion_r83014256
--- Diff:
integration/spark/src/main/java/org/apache/carbondata/spark/load/CarbonLoadModel.java
---
@@ -117,9 +117,9 @@
private String badRecordsLoggerEnable;
/**
- * defines the option to specify the bad record log redirect to raw csv
+ * defines the option to specify the bad record logger action
*/
- private String badRecordsLoggerRedirect;
+ private String badRecordsLoggerAction;
--- End diff --
This action is not for Logger, right? Perhaps `badRecordsAction` is a
better name?
And it should be an enum instead of String
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/227#discussion_r83012391
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
---
@@ -197,8 +197,9 @@ public AbstractFactDataWriter(String storeLocation, int
measureCount, int mdKeyL
blockIndexInfoList = new ArrayList<>();
// get max file size;
CarbonProperties propInstance = CarbonProperties.getInstance();
-this.fileSizeInBytes = blocksize *
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR
-* CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR * 1L;
+// if blocksize=2048, then 2048*1024*1024 will beyond the range of Int
+this.fileSizeInBytes = 1L * blocksize *
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR
--- End diff --
instead of multiple by `1L`, you can just convert `blocksize` to
`(long)blocksize` instead
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user sujith71955 commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/224#discussion_r82983904
--- Diff:
core/src/main/java/org/apache/carbondata/scan/scanner/impl/FilterScanner.java
---
@@ -78,10 +80,11 @@ public FilterScanner(BlockExecutionInfo
blockExecutionInfo) {
* @throws QueryExecutionException
* @throws FilterUnsupportedException
*/
- @Override public AbstractScannedResult scanBlocklet(BlocksChunkHolder
blocksChunkHolder)
+ @Override public AbstractScannedResult scanBlocklet(BlocksChunkHolder
blocksChunkHolder,
+ QueryStatisticsModel
queryStatisticsModel)
throws QueryExecutionException {
try {
- fillScannedResult(blocksChunkHolder);
+ fillScannedResult(blocksChunkHolder, queryStatisticsModel);
--- End diff --
Pass the model in constructor so that no need to change in all API
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user sujith71955 commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/200#discussion_r82977592
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
parserSettings.setMaxColumns(
getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(),
csvParserVo.getMaxColumns()));
parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --
pros of this approach will be suppose in one load user loaded with dirty
data and suddenly he realizes no i need to trim then in the next load he will
enable the option and load the data, this will increase the dictionary space
also, also in query dictionary lookup overhead will increase.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Hi Jacky,
1. Yes. It is better to keep all sorting logic to one step so other types
of sorts can be implemented easily. I will update the design.
2. EncoderProcessorStep can do dictionary encoding and converting
nodictionary and complex types to byte[] representation.
Here encoding interface is flexible for user to give different encoding
representation at row level only.
And about RLE, DELTA and also heavy compression are done at
DataWriterProcessorStep only, it is because these
encodings/compression happens at bloclklet level not row level.
3. Yes, each step requires schema definition, that will be passed as
DataField[] through configuration to initial step InputProcessorStep .
Remaining steps can call child.getOutput() to get the schema. Here
each DataField
represents one column.
Regards,
Ravi
On 12 October 2016 at 09:38, Jacky Li wrote:
> Hi Ravindra,
>
> Regarding the design
> (https://drive.google.com/file/d/0B4TWTVbFSTnqTF85anlDOUQ5S1BqY
> zFpLWcwZnBLSVVqSWpj/view),
> I have following question:
>
> 1. In SortProcessorStep, I think it is better to include MergeSort in this
> step also, so it includes all logic for sorting. In this case, developer
> can
> implement a external sort (spill to files only if necessary), then the
> loading process is a on-line sorting if memory is sufficient. I think it
> will improve loading performance a lot.
>
> 2. In EncoderProcessorStep, apart from the dictionary encoding, what other
> processing it will do? How about delta, RLE, etc.
>
> 3. In InputProcessorStep, it needs some schema definition to parse the
> input
> and convert to the row, right? For example, how to read from JSON, AVRO
> file?
>
> Regards,
> Jacky
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> regrading-design-of-data-load-after-kettle-removal-tp1672p1783.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
--
Thanks & Regards,
Ravi
Github user sujith71955 commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/200#discussion_r82968468
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
parserSettings.setMaxColumns(
getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(),
csvParserVo.getMaxColumns()));
parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --
Guys i think if while data loading we are reading from configuration
inorder to trim it or not same we need to do while doing filter also, based on
configuration value decide.
ex: while loading i enabled trim property, so system will trim and load
the data, now in filter query also if user provides while space then it needs
to be trimmed while creating filter model. This will provide more system
consistentency. if user enable trim then we wont trim it while loading and also
while querying.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user lion-x commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/200#discussion_r82960653
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java
---
@@ -102,8 +102,8 @@ public void initialize() throws IOException {
parserSettings.setMaxColumns(
getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(),
csvParserVo.getMaxColumns()));
parserSettings.setNullValue("");
-parserSettings.setIgnoreLeadingWhitespaces(false);
-parserSettings.setIgnoreTrailingWhitespaces(false);
+parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
--- End diff --
hi sujith,
I agree with Eason, when user query with some space filter, it should be
considered as forbidden action.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
Github user Zhangshunyu commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/224#discussion_r82958230
--- Diff:
core/src/main/java/org/apache/carbondata/scan/processor/AbstractDataBlockIterator.java
---
@@ -127,11 +133,15 @@ protected boolean updateScanner() {
}
}
- private AbstractScannedResult getNextScannedResult() throws
QueryExecutionException {
+ private AbstractScannedResult
getNextScannedResult(QueryStatisticsRecorder recorder,
--- End diff --
@sujith71955 OK, i will use a statistics model, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---