Re: Question related to lazy decoding optimzation

2017-03-08 Thread Ravindra Pesala
Hi Yong Zhang,

Thank you for analyzing carbondata.
Yes, lazy decoding is only possible if the dictionaries are global.
At the time of loading the data it generates global dictionary values.
There are 2 ways to generate global dictionary values.
1. Launch a job to read all input data and find the distinct values from
each columns and assign the dictionary values to it. Then starts the actual
loading job, it just encodes the data with already generated dictionary
values and write down in carbondata format.
2. Launch Dictionary Server/client to generate global dictionary during the
load job. It consults dictionary server to get the global dictionary for
the fields.

Yes, compare to local dictionary it is little more expensive but with this
approach we can have better compression and better performance through lazy
decoding.



Regards,
Ravindra.

On 9 March 2017 at 00:01, Yong Zhang  wrote:

> Hi,
>
>
> I watched one session of "Apache Carbondata" in Spark Submit 2017. The
> video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.
>
> [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<
> https://www.youtube.com/watch?v=lhsAg2H_GXc>
>
> Apache Carbondata: An Indexed Columnar File Format for Interactive Query
> by Jacky Li/Jihong Ma
> www.youtube.com
> Realtime analytics over large datasets has become an increasing
> wide-spread demand, over the past several years, Hadoop ecosystem has been
> continuously evolv...
>
>
>
>
> Starting from 23:10, the speaker talks about lazy decoding optimization,
> and the example given in the speech is following:
>
> "select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be
> aggregated directly by the encoding value (Maybe integer, if let's say a
> String type c3 is encoded as int). I assume this in fact is done even
> within Spark executor engine, as the Speaker described.
>
>
> But I really not sure that I understand this is possible, especially in
> the Spark. If Carbondata is the storage format for a framework on one box,
> I can image that and understand this value it brings. But for a distribute
> executing engine, like Spark, the data will come from multi hosts. Spark
> has to deserialize the data for grouping/aggregating (C3 in this case).
> Let's say that even Spark dedicates this to underline storage engine
> somehow, how Carbondata will make sure that all the value will be encoded
> in the same globally? Won't it just encode consistently per file? Globally
> is just too expensive. But without it, I don't know how this lazy decoding
> can work.
>
>
> I am just start researching this project, so maybe there are something
> underline I don't understand.
>
>
> Thanks
>
>
> Yong
>



-- 
Thanks & Regards,
Ravi


Re: question about dimColumnExecuterInfo.getFilterKeys()

2017-03-08 Thread Ravindra Pesala
Hi,

The filter values which we get from query will be converted to respective
surrogates and sorted on surrogate values before start applying the filter.


Regards,
Ravindra

On 8 March 2017 at 09:55, 马云  wrote:

> Hi  Dev,
>
>
> when do filter query, I can see a filtered byte array.
> Does filterValues always has order by the dictionary value?
> If not, which case it has no order. thanks
>
>
>
>  byte[][] filterValues = dimColumnExecuterInfo.getFilterKeys();
>
>
>
>
>


-- 
Thanks & Regards,
Ravi


Re: I loaded the data with the timestamp field unsuccessful

2017-03-08 Thread kex
This problem has been solved. I run the same sql to another cluster  is
normal, may be the cluster of carbon.properties have problems.




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/I-loaded-the-data-with-the-timestamp-field-unsuccessful-tp8417p8499.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Question related to lazy decoding optimzation

2017-03-08 Thread Yong Zhang
Hi,


I watched one session of "Apache Carbondata" in Spark Submit 2017. The video is 
here: https://www.youtube.com/watch?v=lhsAg2H_GXc.

[https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]

Apache Carbondata: An Indexed Columnar File Format for Interactive Query by 
Jacky Li/Jihong Ma
www.youtube.com
Realtime analytics over large datasets has become an increasing wide-spread 
demand, over the past several years, Hadoop ecosystem has been continuously 
evolv...




Starting from 23:10, the speaker talks about lazy decoding optimization, and 
the example given in the speech is following:

"select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be 
aggregated directly by the encoding value (Maybe integer, if let's say a String 
type c3 is encoded as int). I assume this in fact is done even within Spark 
executor engine, as the Speaker described.


But I really not sure that I understand this is possible, especially in the 
Spark. If Carbondata is the storage format for a framework on one box, I can 
image that and understand this value it brings. But for a distribute executing 
engine, like Spark, the data will come from multi hosts. Spark has to 
deserialize the data for grouping/aggregating (C3 in this case). Let's say that 
even Spark dedicates this to underline storage engine somehow, how Carbondata 
will make sure that all the value will be encoded in the same globally? Won't 
it just encode consistently per file? Globally is just too expensive. But 
without it, I don't know how this lazy decoding can work.


I am just start researching this project, so maybe there are something 
underline I don't understand.


Thanks


Yong


Re: I loaded the data with the timestamp field unsuccessful

2017-03-08 Thread Liang Chen
Hi

If the issue has be fixed?
BTW, you don't need add date column to DICTIONARY_INCLUDE, it do index for
date/timestamp columns.

Regards
Liang

kex wrote
> I loaded the data with the timestamp field unsuccessful,and timestamp
> field is null.
> 
> my sql:
> carbon.sql("create TABLE IF NOT EXISTS test1 (date timestamp,id string)
> STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='date','DATEFORMAT'='date:/MM/dd')")
> 
> carbon.sql("LOAD DATA inpath 'hdfs://myha/user/carbon/testdata/test4.csv'
> INTO TABLE test1 options('FILEHEADER'='date,id')")
> 
> my data test4.csv:
> 2017/3/23,2
> 2017/1/11,1
> 2017/9/17,3
> 
> when i select:
> ++---+
> |date| id|
> ++---+
> |null|  1|
> |null|  2|
> |null|  3|
> ++---+
> 
> i print time format is correct:
> println(CarbonProperties.getInstance().getProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT))
> /MM/dd
> 
> What could be the reason for it?
> thx.





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/I-loaded-the-data-with-the-timestamp-field-unsuccessful-tp8417p8472.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: Apache CarbonData online meetup on 13th Mar,2017

2017-03-08 Thread Liang Chen
Hi phalodi

Sorry for this.
Apache CarbonData community will organize meetup in India soon.

Regards
Liang

phalodi wrote
> Hi , I also want to join this meetup but when i register for the meetup
> and proceed to pay it will not show the indian banks for payment options.
> 
> On Tue, Mar 7, 2017 at 8:42 AM, ZhuWilliam <

> allwefantasy@

> > wrote:
> 
>> Carbondata's reversed index,BTree implementation and how carbondata  do
>> the
>> spark integration
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/Apache-
>> CarbonData-online-meetup-on-13th-Mar-2017-tp8339p8343.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Apache-CarbonData-online-meetup-on-13th-Mar-2017-tp8339p8471.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[jira] [Created] (CARBONDATA-753) Fix Date and Timestamp format issues

2017-03-08 Thread Liang Chen (JIRA)
Liang Chen created CARBONDATA-753:
-

 Summary: Fix Date and Timestamp format issues
 Key: CARBONDATA-753
 URL: https://issues.apache.org/jira/browse/CARBONDATA-753
 Project: CarbonData
  Issue Type: Bug
  Components: core, examples
Affects Versions: 1.0.0-incubating
Reporter: Liang Chen
Assignee: Liang Chen
Priority: Minor
 Fix For: 1.1.0-incubating, 1.0.1-incubating


Fix Date and Timestamp format issues:
1.Optimize the description of CARBON_TIMESTAMP_FORMAT&CARBON_DATE_FORMAT  in 
CarbonCommonConstants.java
2.Correct filed definition of Date and Timestamp in examples.
3.Add example, how to show raw data's timestamp format. currently 
spark.sql.show() by default using "-mm-dd hh:mm:ss.f" as 
Timestamp.toString() format, users always wanting the raw data format.  




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-752) creating complex type gives exception

2017-03-08 Thread anubhav tarar (JIRA)
anubhav tarar created CARBONDATA-752:


 Summary: creating complex type gives exception
 Key: CARBONDATA-752
 URL: https://issues.apache.org/jira/browse/CARBONDATA-752
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.0.0-incubating
 Environment: spark 2,.1
Reporter: anubhav tarar
Assignee: anubhav tarar
Priority: Trivial


using complex type in create table gives me exception

spark.sql(
  s"""
 | CREATE TABLE carbon_table(
 |shortField short,
 |intField int,
 |bigintField long,
 |doubleField double,
 |stringField string,
 |timestampField timestamp,
 |decimalField decimal(18,2),
 |dateField date,
 |charField char(5),
 |floatField float,
 |complexData array
 | )
 | STORED BY 'CARBONDATA'
 | TBLPROPERTIES('DICTIONARY_INCLUDE'='dateField, charField')
   """.stripMargin)

it gives me exception

Caused by: java.lang.RuntimeException: Unsupported data type: 
ArrayType(StringType,true)
Caused by: java.lang.RuntimeException: Unsupported data type: 
ArrayType(StringType,true)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.carbondata.spark.util.DataTypeConverterUtil$.convertToCarbonTypeForSpark2(DataTypeConverterUtil.scala:61



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re:please help for outofmemory issue in eclipse

2017-03-08 Thread 马云
please ignore my issue.
 I change JDK from 1.8 to 1.7 and add the below, it runs successfully now. 


-Xmx3550m -Xms3550m -XX:MaxPermSize=512m










At 2017-03-08 17:20:58, "马云"  wrote:

Hi dev,


today I start setup carbon data 1.0 in my local eclipse 
I use "-X -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package" to do 
maven build in eclipse successfully.
but when I run the CarbonExample in eclipse, it shows the below issue(refer to 
the below log).
Even I configure -Xmx10g -Xms10g, it also show the issue.


Can anyone help me? thanks



INFO  08-03 16:50:59,037 - Running Spark version 1.6.2

WARN  08-03 16:51:01,624 - Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable

INFO  08-03 16:51:01,752 - Changing view acls to: mayun

INFO  08-03 16:51:01,753 - Changing modify acls to: mayun

INFO  08-03 16:51:01,754 - SecurityManager: authentication disabled; ui acls 
disabled; users with view permissions: Set(mayun); users with modify 
permissions: Set(mayun)

INFO  08-03 16:51:02,274 - Successfully started service 'sparkDriver' on port 
51080.

INFO  08-03 16:51:02,609 - Slf4jLogger started

INFO  08-03 16:51:02,649 - Starting remoting

INFO  08-03 16:51:02,808 - Remoting started; listening on addresses 
:[akka.tcp://sparkDriverActorSystem@10.100.56.61:51081]

INFO  08-03 16:51:02,814 - Successfully started service 
'sparkDriverActorSystem' on port 51081.

INFO  08-03 16:51:02,824 - Registering MapOutputTracker

INFO  08-03 16:51:02,844 - Registering BlockManagerMaster

INFO  08-03 16:51:02,857 - Created local directory at 
/private/var/folders/qg/b6zvdz3n1cggqx66yzc6m_s4gn/T/blockmgr-85a89cab-9e48-4708-be89-cde6951285fe

INFO  08-03 16:51:02,870 - MemoryStore started with capacity 12.7 GB

INFO  08-03 16:51:02,926 - Registering OutputCommitCoordinator

INFO  08-03 16:51:03,072 - jetty-8.y.z-SNAPSHOT

INFO  08-03 16:51:03,118 - Started SelectChannelConnector@0.0.0.0:4040

INFO  08-03 16:51:03,118 - Successfully started service 'SparkUI' on port 4040.

INFO  08-03 16:51:03,121 - Started SparkUI at http://10.100.56.61:4040

INFO  08-03 16:51:03,212 - Starting executor ID driver on host localhost

INFO  08-03 16:51:03,228 - Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 51082.

INFO  08-03 16:51:03,229 - Server created on 51082

INFO  08-03 16:51:03,230 - Trying to register BlockManager

INFO  08-03 16:51:03,233 - Registering block manager localhost:51082 with 12.7 
GB RAM, BlockManagerId(driver, localhost, 51082)

INFO  08-03 16:51:03,234 - Registered BlockManager

Starting CarbonExample using spark version 1.6.2

Exception in thread "main" 

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler 
in thread "main"













 

please help for outofmemory issue in eclipse

2017-03-08 Thread 马云
Hi dev,


today I start setup carbon data 1.0 in my local eclipse 
I use "-X -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package" to do 
maven build in eclipse successfully.
but when I run the CarbonExample in eclipse, it shows the below issue(refer to 
the below log).
Even I configure -Xmx10g -Xms10g, it also show the issue.


Can anyone help me? thanks



INFO  08-03 16:50:59,037 - Running Spark version 1.6.2

WARN  08-03 16:51:01,624 - Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable

INFO  08-03 16:51:01,752 - Changing view acls to: mayun

INFO  08-03 16:51:01,753 - Changing modify acls to: mayun

INFO  08-03 16:51:01,754 - SecurityManager: authentication disabled; ui acls 
disabled; users with view permissions: Set(mayun); users with modify 
permissions: Set(mayun)

INFO  08-03 16:51:02,274 - Successfully started service 'sparkDriver' on port 
51080.

INFO  08-03 16:51:02,609 - Slf4jLogger started

INFO  08-03 16:51:02,649 - Starting remoting

INFO  08-03 16:51:02,808 - Remoting started; listening on addresses 
:[akka.tcp://sparkDriverActorSystem@10.100.56.61:51081]

INFO  08-03 16:51:02,814 - Successfully started service 
'sparkDriverActorSystem' on port 51081.

INFO  08-03 16:51:02,824 - Registering MapOutputTracker

INFO  08-03 16:51:02,844 - Registering BlockManagerMaster

INFO  08-03 16:51:02,857 - Created local directory at 
/private/var/folders/qg/b6zvdz3n1cggqx66yzc6m_s4gn/T/blockmgr-85a89cab-9e48-4708-be89-cde6951285fe

INFO  08-03 16:51:02,870 - MemoryStore started with capacity 12.7 GB

INFO  08-03 16:51:02,926 - Registering OutputCommitCoordinator

INFO  08-03 16:51:03,072 - jetty-8.y.z-SNAPSHOT

INFO  08-03 16:51:03,118 - Started SelectChannelConnector@0.0.0.0:4040

INFO  08-03 16:51:03,118 - Successfully started service 'SparkUI' on port 4040.

INFO  08-03 16:51:03,121 - Started SparkUI at http://10.100.56.61:4040

INFO  08-03 16:51:03,212 - Starting executor ID driver on host localhost

INFO  08-03 16:51:03,228 - Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 51082.

INFO  08-03 16:51:03,229 - Server created on 51082

INFO  08-03 16:51:03,230 - Trying to register BlockManager

INFO  08-03 16:51:03,233 - Registering block manager localhost:51082 with 12.7 
GB RAM, BlockManagerId(driver, localhost, 51082)

INFO  08-03 16:51:03,234 - Registered BlockManager

Starting CarbonExample using spark version 1.6.2

Exception in thread "main" 

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler 
in thread "main"










Re: I loaded the data with the timestamp field unsuccessful

2017-03-08 Thread QiangCai
try /M/dd

Best Regards
David CaiQiang



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/I-loaded-the-data-with-the-timestamp-field-unsuccessful-tp8417p8419.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.