Re: [Discussion] Implement Partition Table Feature

2017-04-17 Thread Lu Cao
1. carbon use different sql parser in spark1.6 and 2.1, need to change
CarbonSQLParser for 1.6
2. for interval range partition, no fixed partition name is defined in DDL,
but need to keep partition name in schema and update when new partition is
added.
3. one btree for one partition and one segment in driver side

On Mon, Apr 17, 2017 at 3:29 PM, QiangCai  wrote:

> sub-task list of Partition Table Feature:
>
> 1. Define PartitionInfo model
> modify schema.thrift to define PartitionInfo, add PartitionInfo to
> TableSchema
>
> 2. Create Table with Partition
> CarbonSparkSqlParser parse partition part to generate PartitionInfo, add
> PartitionInfo to TableModel.
>
> CreateTable add PartitionInfo to TableInfo,  store PartitionInfo in
> TableSchema
>
> 3. Data loading of partition table
> use PartitionInfo to generate Partitioner (hash, list, range)
> use Partitioner to repartition input data file, reuse loadDataFrame flow
> use partition id to replace task no in carbondata/index file name
>
> 4. Detail filter query on partition column
> support equal filter to get partition id, use this partition id to filter
> BTree.
> In the future, will support other filter(range, in...)
>
> 5. Partition tables join on partition column
>
> 6. Alter table add/drop partition
>
> Any suggestion?
>
> Best Regards,
> David QiangCai
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Implement-Partition-Table-Feature-tp10938p11151.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


bucket table

2017-04-12 Thread Lu Cao
Hi Dev,
I created a bucket table and loaded 100 million rows of data. But there is
only one data file and one index file in path.
It returned a "partially success" error when loading was done.

I debugged into the program and it shows bad record is not null.
Any one can explain why bad record will cause incorrect bucket number and
is there any prompt for users to fix the bad record issue?

Thanks,
Lionel


Re: Getting Error in Cloudera Distribution

2017-04-07 Thread Lu Cao
You only need to add the assembly jar, instead of the others you list.
4e.g.
assembly/target/scala-2.10/carbondata_2.10-1.1.0-incubating-SNAPSHOT-shade-hadoop2.2.0.jar

On Fri, Apr 7, 2017 at 4:28 PM, Srigopal Mohanty <srigopalmoha...@gmail.com>
wrote:

> Spark version is 1.6
>
> Yes i am using spark-shell --jars 
>
> The jars present are -
>
>- carbondata-common-1.1.0-incubating-SNAPSHOT.jar
>- carbondata-core-1.1.0-incubating-SNAPSHOT.jar
>- carbondata-hadoop-1.1.0-incubating-SNAPSHOT.jar
>- carbondata-processing-1.1.0-incubating-SNAPSHOT.jar
>
>
> On 07-Apr-2017 1:36 pm, "Lu Cao" <whuca...@gmail.com> wrote:
>
> What's the spark version you're using?
> Did you add "--jars " when you start the spark shell?
>
> On Fri, Apr 7, 2017 at 3:38 PM, Srigopal Mohanty <
> srigopalmoha...@gmail.com>
> wrote:
>
> > Hi Lionel,
> >
> >   I followed the same link. I am using cloudera sandbox, which is
> > preconfigured.
> >
> > CarbonData is cloned it from git  URL specified. And maven build was done
> > as per the steps mentioned.
> >
> > Thanks,
> > Srigopal
> >
> > On 07-Apr-2017 1:05 pm, "Lu Cao" <whuca...@gmail.com> wrote:
> >
> > Hi Srigopal,
> > You can follow this:
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/quick-start-guide.md
> > Make sure you have correctly configured carbon and spark.
> >
> > Thanks,
> > Lionel
> >
> > On Fri, Apr 7, 2017 at 3:16 PM, Srigopal Mohanty <
> > srigopalmoha...@gmail.com>
> > wrote:
> >
> > > Hi Team,
> > >
> > >   Getting error after building the carbon data git repo the in Cloudera
> > > distribution.
> > >
> > > in spark-shell console -
> > >
> > > :25: error: object CarbonContext is not a member of package
> > > org.apache.spark.sql
> > >  import org.apache.spark.sql.CarbonContext
> > >
> > >
> > > Any pointers.
> > >
> > > Thanks,
> > > Srigopal
> > >
> >
>


Re: Getting Error in Cloudera Distribution

2017-04-07 Thread Lu Cao
What's the spark version you're using?
Did you add "--jars " when you start the spark shell?

On Fri, Apr 7, 2017 at 3:38 PM, Srigopal Mohanty <srigopalmoha...@gmail.com>
wrote:

> Hi Lionel,
>
>   I followed the same link. I am using cloudera sandbox, which is
> preconfigured.
>
> CarbonData is cloned it from git  URL specified. And maven build was done
> as per the steps mentioned.
>
> Thanks,
> Srigopal
>
> On 07-Apr-2017 1:05 pm, "Lu Cao" <whuca...@gmail.com> wrote:
>
> Hi Srigopal,
> You can follow this:
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/quick-start-guide.md
> Make sure you have correctly configured carbon and spark.
>
> Thanks,
> Lionel
>
> On Fri, Apr 7, 2017 at 3:16 PM, Srigopal Mohanty <
> srigopalmoha...@gmail.com>
> wrote:
>
> > Hi Team,
> >
> >   Getting error after building the carbon data git repo the in Cloudera
> > distribution.
> >
> > in spark-shell console -
> >
> > :25: error: object CarbonContext is not a member of package
> > org.apache.spark.sql
> >  import org.apache.spark.sql.CarbonContext
> >
> >
> > Any pointers.
> >
> > Thanks,
> > Srigopal
> >
>


Re: Getting Error in Cloudera Distribution

2017-04-07 Thread Lu Cao
Hi Srigopal,
You can follow this:
https://github.com/apache/incubator-carbondata/blob/master/docs/quick-start-guide.md
Make sure you have correctly configured carbon and spark.

Thanks,
Lionel

On Fri, Apr 7, 2017 at 3:16 PM, Srigopal Mohanty 
wrote:

> Hi Team,
>
>   Getting error after building the carbon data git repo the in Cloudera
> distribution.
>
> in spark-shell console -
>
> :25: error: object CarbonContext is not a member of package
> org.apache.spark.sql
>  import org.apache.spark.sql.CarbonContext
>
>
> Any pointers.
>
> Thanks,
> Srigopal
>


Re: Re:Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread Lu Cao
@Liang, Yes, actually I'm currently working on the limit query
optimization.
I get the limited dictionary value and convert to the filter condition in
CarbonOptimizer step.
It would definitely improve the query performance in some scenario.

On Thu, Mar 30, 2017 at 2:07 PM, Liang Chen  wrote:

> Hi
>
> +1 for simafengyun's optimization, it looks good to me.
>
> I propose to do "limit" pushdown first, similar with filter pushdown. what
> is your opionion? @simafengyun
>
> For "order by" pushdown, let us work out an ideal solution to consider all
> aggregation push down cases. Ravindara's comment is reasonable, we need to
> consider decoupling spark and carbondata, otherwise maintenance cost might
> be high if do computing works at both side, because we need to keep
> utilizing Spark' computing capability along with its version evolution.
>
> Regards
> Liang
>
>
> simafengyun wrote
> > Hi Ravindran,
> > yes, use carbon do the sorting if the order by column is not first
> > column.But its sorting is very high since the dimension data in blocklet
> > is stored after sorting.So in carbon can use  merge sort  + topN to get N
> > data from each block.In addition,  the biggest difference is that it can
> > reduce disk IO since can use limit n to reduce required blocklets.if you
> > only apply spark's top N, I don't think you can make  suck below
> > performance. That's impossible  if don't reduce disk IO.
> >
>  n5.nabble.com/file/n9834/%E6%9C%AA%E5%91%BD%E5%90%8D2.jpg>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2017-03-30 03:12:54, "Ravindra Pesala" ravi.pes...@gmail.com
> > wrote:
> >>Hi,
> >>
> >>You mean Carbon do the sorting if the order by column is not first column
> >>and provide only limit values to spark. But the same job spark is also
> >>doing it just sorts the partition and gets the top values out of it. You
> >>can reduce the table_blocksize to get the better sort performance as
> spark
> >>try to do sorting inside memory.
> >>
> >>I can see we can do some optimizations in integration layer itself with
> out
> >>pushing down any logic to carbon like if the order by column is first
> >>column then we can just get limit values with out sorting any data.
> >>
> >>Regards,
> >>Ravindra.
> >>
> >>On 29 March 2017 at 08:58, 马云 simafengyun1...@163.com wrote:
> >>
> >>> Hi Ravindran,
> >>> Thanks for your quick response. please see my answer as below
> >>> 
> >>>  What if the order by column is not the first column? It needs to scan
> >>> all
> >>> blocklets to get the data out of it if the order by column is not first
> >>> column of mdk
> >>> 
> >>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs
> >>> to
> >>> scan all blocklets to get the data out of it if the order by column is
> >>> not
> >>> first column of mdk
> >>> but it just scan all the order by column's data, for
> >>> others columns data,  use the lazy-load strategy and  it can reduce
> scan
> >>> accordingly to  limit value.
> >>> Hence you can see the performance is much better now
> >>> after  my optimization. Currently the carbondata order by + limit
> >>> performance is very bad since it scans all data.
> >>>in my test there are  20,000,000 data, it takes more
> than
> >>> 10s, if data is much more huge,  I think it is hard for user to stand
> >>> such
> >>> bad performance when they do order by + limit  query?
> >>>
> >>>
> >>> 
> >>>  We used to have multiple push down optimizations from spark to carbon
> >>> like aggregation, limit, topn etc. But later it was removed because it
> >>> is
> >>> very hard to maintain for version to version. I feel it is better that
> >>> execution engine like spark can do these type of operations.
> >>> 
> >>> Answer : In my opinion, I don't think "hard to maintain for version to
> >>> version" is a good reason to give up the order by  + limit
> optimization.
> >>> I think it can create new class to extends current and try to reduce
> the
> >>> impact for the current code. Maybe can make it is easy to maintain.
> >>> Maybe I am wrong.
> >>>
> >>>
> >>>
> >>>
> >>> At 2017-03-29 02:21:58, "Ravindra Pesala" ravi.pes...@gmail.com
> 
> >>> wrote:
> >>>
> >>>
> >>> Hi Jarck Ma,
> >>>
> >>> It is great to try optimizing Carbondata.
> >>> I think this solution comes up with many limitations. What if the order
> >>> by
> >>> column is not the first column? It needs to scan all blocklets to get
> >>> the
> >>> data out of it if the order by column is not first column of mdk.
> >>>
> >>> We used to have multiple push down optimizations from spark to carbon
> >>> like
> >>> aggregation, limit, topn etc. But later it was removed because it is
> >>> very
> >>> 

A question about sort in carbon

2017-02-16 Thread Lu Cao
Hi dev team,
I have a question about the sort in carbon data.
When we have following query:
select country, area, name, salary from table_a order by country;
It seems carbon will decode the country column from dictionary value to
original value first, and then sort by original value.

My question : Is the dictionary value order always the same with original
value order?
Or if we sort the dictionary value first and then decode to original value,
would that be correct operation?

BTW: where can I see the algorithm of Dictionary encode(class name or file
name)?

Thanks,
Lionel


Re: [jira] [Created] (CARBONDATA-559) Job failed at last step

2016-12-23 Thread Lu Cao
Hi team,
Could you help look into this issue?
I have attached the log in the Jira ticket.

Thanks & Best Regards,
Lionel

On Fri, Dec 23, 2016 at 5:47 PM, Cao, Lionel (JIRA)  wrote:

> Cao, Lionel created CARBONDATA-559:
> --
>
>  Summary: Job failed at last step
>  Key: CARBONDATA-559
>  URL: https://issues.apache.org/jira/browse/CARBONDATA-559
>  Project: CarbonData
>   Issue Type: Bug
>   Components: core
> Affects Versions: 0.2.0-incubating
>  Environment: carbon version: branch-0.2
> hadoop 2.4.0
> spark 1.6.0
> OS centOS
> Reporter: Cao, Lionel
>
>
> Hi team,
> My job alwasy failed at last step:
> it said 'yarn' user don't have write access to target data
> path(storeLocation).
> But I tested twice with 1 rows data, both successed. could you help
> look into the log? Please refer to the attachment.
> Search 'access=WRITE' you can see the exception.
> Search 'Exception' for other exceptions.
>
> thanks,
> Lionel
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: carbondata-0.2 load data failed in yarn molde

2016-12-23 Thread Lu Cao
Hi,
Thank you all, I have found the root cause is the $SPARK_HOME/conf path was
reset by CDM spark service restart.
U r correct, put the carbon.properties file in both driver and executor can
solve this problem.

Thanks,
Lionel

On Fri, Dec 23, 2016 at 2:40 PM, manish gupta <tomanishgupt...@gmail.com>
wrote:

> Hi Lu Cao,
>
> The problem you are facing "Dictionary file is locked for updation" can
> also come when the path formation is incorrect for the dictionary files.
>
> You have to set carbon.properties file path both in driver and executor
> side. In spark application master executor logs you will find the path
> printed for dictionary files. Just validate that path with the one you have
> configured in carbon.properties file.
>
> Regards
> Manish Gupta
>
> On Fri, Dec 23, 2016 at 7:43 AM, Lu Cao <whuca...@gmail.com> wrote:
>
> > Hi team,
> > Looks like I've met the same problem about dictionary file is locked.
> Could
> > you share what changes you made about the configuration?
> >
> > ERROR 23-12 09:55:26,222 - Executor task launch worker-0
> > java.lang.RuntimeException: Dictionary file vehsyspwrmod is locked for
> > updation. Please try after some time
> > at scala.sys.package$.error(package.scala:27)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> > RDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> RDD.compute(
> > CarbonGlobalDictionaryRDD.scala:293)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > ERROR 23-12 09:55:26,223 - Executor task launch worker-7
> > java.lang.RuntimeException: Dictionary file vehindlightleft is locked for
> > updation. Please try after some time
> > at scala.sys.package$.error(package.scala:27)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> > RDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> RDD.compute(
> > CarbonGlobalDictionaryRDD.scala:293)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > ERROR 23-12 09:55:26,223 - Executor task launch worker-4
> > java.lang.RuntimeException: Dictionary file vehwindowrearleft is locked
> for
> > updation. Please try after some time
> > at scala.sys.package$.error(package.scala:27)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> > RDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
> > at
> > org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerate
> RDD.compute(
> > CarbonGlobalDictionaryRDD.scala:293)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > ERROR 23-12 09:55:26,226 - Exception in task 5.0 in stage 5.0 (TID 9096)
> > java.lang.RuntimeException: Dictionary file vehsyspwrmod is locked for
> > updation. Please try after some time
> > at scala.sys.package$.error(package.scala:27)
> > at
> > org.apach

Re: carbondata-0.2 load data failed in yarn molde

2016-12-22 Thread Lu Cao
Hi team,
Looks like I've met the same problem about dictionary file is locked. Could
you share what changes you made about the configuration?

ERROR 23-12 09:55:26,222 - Executor task launch worker-0
java.lang.RuntimeException: Dictionary file vehsyspwrmod is locked for
updation. Please try after some time
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:293)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR 23-12 09:55:26,223 - Executor task launch worker-7
java.lang.RuntimeException: Dictionary file vehindlightleft is locked for
updation. Please try after some time
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:293)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR 23-12 09:55:26,223 - Executor task launch worker-4
java.lang.RuntimeException: Dictionary file vehwindowrearleft is locked for
updation. Please try after some time
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:293)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR 23-12 09:55:26,226 - Exception in task 5.0 in stage 5.0 (TID 9096)
java.lang.RuntimeException: Dictionary file vehsyspwrmod is locked for
updation. Please try after some time
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:293)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR 23-12 09:55:26,226 - Exception in task 13.0 in stage 5.0 (TID 9104)
java.lang.RuntimeException: Dictionary file vehwindowrearleft is locked for
updation. Please try after some time
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:353)
at
org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:293)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at 

Re: [Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

2016-12-13 Thread Lu Cao
iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC.(:59)
at $iwC$$iwC.(:61)
at $iwC.(:63)
at (:65)
at .(:69)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Exception occurred in query
execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.(CarbonScanRDD.scala:226)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

On Wed, Dec 14, 2016 at 10:18 AM, Lu Cao <whuca...@gmail.com> wrote:

> Hi,
> I just uploaded the data file to Baidu:
> 链接: https://pan.baidu.com/s/1slERWL3
> 密码: m7kj
>
> Thanks,
> Lionel
>
> On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <whuca...@gmail.com> wrote:
>
>> Hi Dev team,
>> As discussed this afternoon, I've changed back to 0.2.0 version for the
>> testing. Please ignore the former email about "error when save DF to
>> carbondata file", that's on master branch.
>>
>> 

Re: [Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

2016-12-13 Thread Lu Cao
Hi,
I just uploaded the data file to Baidu:
链接: https://pan.baidu.com/s/1slERWL3
密码: m7kj

Thanks,
Lionel

On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <whuca...@gmail.com> wrote:

> Hi Dev team,
> As discussed this afternoon, I've changed back to 0.2.0 version for the
> testing. Please ignore the former email about "error when save DF to
> carbondata file", that's on master branch.
>
> Spark version: 1.6.0
> System: Mac OS X EI Capitan(10.11.6)
>
> [lucao]$ spark-shell --master local[*] --total-executor-cores 2
> --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c
> onnector-java-5.1.40-bin.jar
>
> In 0.2.0, I can successfully create table and load data into carbondata
> table
>
> scala> cc.sql("create table if not exists default.mycarbon_1(vin
> String, data_date String, work_model Double) stored by 'carbondata'")
>
> scala> cc.sql("load data inpath'test2.csv' into table
> default.mycarbon_1")
>
> I can successfully run below query:
>
>scala> cc.sql("select vin, count(*) from default.mycarbon_1 group
> by vin").show
>
> INFO  13-12 17:13:42,215 - Job 5 finished: show at :42, took
> 0.732793 s
>
> +-+---+
>
> |  vin|_c1|
>
> +-+---+
>
> |LSJW26760ES065247|464|
>
> |LSJW26760GS018559|135|
>
> |LSJW26761ES064611|104|
>
> |LSJW26761FS090787| 45|
>
> |LSJW26762ES051513| 40|
>
> |LSJW26762FS075036|434|
>
> |LSJW26763ES052363| 32|
>
> |LSJW26763FS088491|305|
>
> |LSJW26764ES064859|186|
>
> |LSJW26764FS078696| 40|
>
> |LSJW26765ES058651|171|
>
> |LSJW26765FS072633|191|
>
> |LSJW26765GS056837|467|
>
> |LSJW26766FS070308| 79|
>
> |LSJW26766GS050853|300|
>
> |LSJW26767FS069913|  8|
>
> |LSJW26767GS053454|286|
>
> |LSJW26768FS062811| 16|
>
> |LSJW26768GS051146| 97|
>
> |LSJW26769FS062722|424|
>
> +-+---+
>
> only showing top 20 rows
>
> The error occurred when I add "vin" column into where clause:
>
> scala> cc.sql("select vin, count(*) from default.mycarbon_1 where
> vin='LSJW26760ES065247' group by vin")
>
> +-+---+
>
> |  vin|_c1|
>
> +-+---+
>
> |LSJW26760ES065247|464|
>
> +-+---+
>
> >>> This one is OK... Actually as I tested, the *first two value* in the
> top 20 rows usually successed but for most of others it will return error.
>
> For example :
>
> scala> cc.sql("select vin, count(*) from default.mycarbon_1 where
> vin='LSJW26765GS056837' group by vin").show
>
> >>>Log is coming:
>
> 
>
>
> It is the same error I met at Dec. 6th. As I said in the WeChat Group
> before:
>
>When the data set is 1000 rows, no above error occurred.
>
>When the data set is 1M rows, some returned error, some didn't.
>
>When the data set is 1.9 billion, all tests returned error.
>
>
> *### Attached the sample data set (1M rows) for your reference.*
>
> <<I sent this email yesterday afternoon but it was rejected by
> apache mail server due to larger than 100 bytes, so remove the sample
> data file from attachment, if you need it please reply your personal email
> address>>
>
> Looking forward to your response.
>
>
> Thanks & Best Regards,
>
> Lionel
>


[Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

2016-12-13 Thread Lu Cao
Hi Dev team,
As discussed this afternoon, I've changed back to 0.2.0 version for the
testing. Please ignore the former email about "error when save DF to
carbondata file", that's on master branch.

Spark version: 1.6.0
System: Mac OS X EI Capitan(10.11.6)

[lucao]$ spark-shell --master local[*] --total-executor-cores 2
--executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c
onnector-java-5.1.40-bin.jar

In 0.2.0, I can successfully create table and load data into carbondata
table

scala> cc.sql("create table if not exists default.mycarbon_1(vin
String, data_date String, work_model Double) stored by 'carbondata'")

scala> cc.sql("load data inpath'test2.csv' into table
default.mycarbon_1")

I can successfully run below query:

   scala> cc.sql("select vin, count(*) from default.mycarbon_1 group by
vin").show

INFO  13-12 17:13:42,215 - Job 5 finished: show at :42, took
0.732793 s

+-+---+

|  vin|_c1|

+-+---+

|LSJW26760ES065247|464|

|LSJW26760GS018559|135|

|LSJW26761ES064611|104|

|LSJW26761FS090787| 45|

|LSJW26762ES051513| 40|

|LSJW26762FS075036|434|

|LSJW26763ES052363| 32|

|LSJW26763FS088491|305|

|LSJW26764ES064859|186|

|LSJW26764FS078696| 40|

|LSJW26765ES058651|171|

|LSJW26765FS072633|191|

|LSJW26765GS056837|467|

|LSJW26766FS070308| 79|

|LSJW26766GS050853|300|

|LSJW26767FS069913|  8|

|LSJW26767GS053454|286|

|LSJW26768FS062811| 16|

|LSJW26768GS051146| 97|

|LSJW26769FS062722|424|

+-+---+

only showing top 20 rows

The error occurred when I add "vin" column into where clause:

scala> cc.sql("select vin, count(*) from default.mycarbon_1 where
vin='LSJW26760ES065247' group by vin")

+-+---+

|  vin|_c1|

+-+---+

|LSJW26760ES065247|464|

+-+---+

>>> This one is OK... Actually as I tested, the *first two value* in the
top 20 rows usually successed but for most of others it will return error.

For example :

scala> cc.sql("select vin, count(*) from default.mycarbon_1 where
vin='LSJW26765GS056837' group by vin").show

>>>Log is coming:




It is the same error I met at Dec. 6th. As I said in the WeChat Group
before:

   When the data set is 1000 rows, no above error occurred.

   When the data set is 1M rows, some returned error, some didn't.

   When the data set is 1.9 billion, all tests returned error.


*### Attached the sample data set (1M rows) for your reference.*

<>

Looking forward to your response.


Thanks & Best Regards,

Lionel


error when save DF to carbondata file

Hi Dev team,
I run spark-shell in my local spark standalone mode. It returned error

 java.io.IOException: No input paths specified in job

when I was trying to save the df to carbondata file. Do I miss any
settings about the path??



==

scala> df.write.format("carbondata").option("tableName",
"MyCarbon1").option("compress", "true").option("useKettle",
"false").mode(SaveMode.Overwrite).save()

INFO  13-12 13:58:12,899 - main Query [

  CREATE TABLE IF NOT EXISTS DEFAULT.MYCARBON1

  (VIN STRING, DATA_DATE STRING, WORK_MODEL DOUBLE)

  STORED BY 'ORG.APACHE.CARBONDATA.FORMAT'

  ]

INFO  13-12 13:58:13,060 - Removed broadcast_0_piece0 on
localhost:56692 in memory (size: 19.5 KB, free: 143.2 MB)

INFO  13-12 13:58:13,081 - Parsing command:

  CREATE TABLE IF NOT EXISTS default.MyCarbon1

  (vin STRING, data_date STRING, work_model DOUBLE)

  STORED BY 'org.apache.carbondata.format'



INFO  13-12 13:58:14,008 - Parse Completed

AUDIT 13-12 13:58:14,326 - [lumac.local][lucao][Thread-1]Creating
Table with Database name [default] and Table name [mycarbon1]

INFO  13-12 13:58:14,335 - 0: get_tables: db=default pat=.*

INFO  13-12 13:58:14,335 - ugi=lucao ip=unknown-ip-addr
cmd=get_tables: db=default pat=.*

INFO  13-12 13:58:14,342 - main Table block size not specified for
default_mycarbon1. Therefore considering the default value 1024 MB

INFO  13-12 13:58:14,434 - Table mycarbon1 for Database default
created successfully.

INFO  13-12 13:58:14,434 - main Table mycarbon1 for Database default
created successfully.

INFO  13-12 13:58:14,440 - main Query [CREATE TABLE DEFAULT.MYCARBON1
USING CARBONDATA OPTIONS (TABLENAME "DEFAULT.MYCARBON1", TABLEPATH
"HDFS://LOCALHOST:9000/USER/LUCAO/DEFAULT/MYCARBON1") ]

INFO  13-12 13:58:14,452 - 0: get_table : db=default tbl=mycarbon1

INFO  13-12 13:58:14,452 - ugi=lucao ip=unknown-ip-addr cmd=get_table
: db=default tbl=mycarbon1

WARN  13-12 13:58:14,463 - Couldn't find corresponding Hive SerDe for
data source provider carbondata. Persisting data source relation
`default`.`mycarbon1` into Hive metastore in Spark SQL specific
format, which is NOT compatible with Hive.

INFO  13-12 13:58:14,588 - 0: create_table: Table(tableName:mycarbon1,
dbName:default, owner:lucao, createTime:1481608694, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col,
type:array, comment:from deserializer)], location:null,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
parameters:{tablePath=hdfs://localhost:9000/user/lucao/default/mycarbon1,
serialization.format=1, tableName=default.mycarbon1}), bucketCols:[],
sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[],
skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[],
parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=carbondata},
viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE,
privileges:PrincipalPrivilegeSet(userPrivileges:{},
groupPrivileges:null, rolePrivileges:null))

INFO  13-12 13:58:14,588 - ugi=lucao ip=unknown-ip-addr
cmd=create_table: Table(tableName:mycarbon1, dbName:default,
owner:lucao, createTime:1481608694, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array,
comment:from deserializer)], location:null,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
parameters:{tablePath=hdfs://localhost:9000/user/lucao/default/mycarbon1,
serialization.format=1, tableName=default.mycarbon1}), bucketCols:[],
sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[],
skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[],
parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=carbondata},
viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE,
privileges:PrincipalPrivilegeSet(userPrivileges:{},
groupPrivileges:null, rolePrivileges:null))

INFO  13-12 13:58:14,598 - Creating directory if it doesn't exist:
hdfs://localhost:9000/user/hive/warehouse/mycarbon1

AUDIT 13-12 13:58:14,717 - [lumac.local][lucao][Thread-1]Table created
with Database name [default] and Table name [mycarbon1]

INFO  13-12 13:58:14,767 - mapred.output.compress is deprecated.
Instead, use mapreduce.output.fileoutputformat.compress

INFO  13-12 13:58:14,767 - mapred.output.compression.codec is
deprecated. Instead, use
mapreduce.output.fileoutputformat.compress.codec

INFO  13-12 13:58:14,767 - 

Re: select return error when filter string column in where clause

k.scheduler.Task.run(Task.scala:89)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

On Tue, Dec 6, 2016 at 9:35 AM, Lu Cao <whuca...@gmail.com> wrote:
> Hi Dev team,
> I have loaded some data into carbondata table. But when I put the id
> column(String type) in where clause it always return error as below:
>
> cc.sql("select to_date(data_date),count(*) from default.carbontest_001
> where id='LSJW26762FS044062' group by to_date(data_date)").show
>
>
>
> ===
> WARN  06-12 09:02:13,763 - Lost task 5.0 in stage 44.0 (TID 687,
> .com): java.lang.RuntimeException: Exception occurred in query
> execution.Please check logs.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.(CarbonScanRDD.scala:226)
> at 
> org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> ERROR 06-12 09:02:14,091 - Task 1 in stage 44.0 failed 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 44.0 failed 4 times, most recent failure: Lost task
> 1.3 in stage 44.0 (TID 694, scsp00258.saicdt.com):
> java.lang.RuntimeException: Exception occurred in query
> execution.Please check logs.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.(CarbonScanRDD.scala:226)
> at 
> org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(

select return error when filter string column in where clause

Hi Dev team,
I have loaded some data into carbondata table. But when I put the id
column(String type) in where clause it always return error as below:

cc.sql("select to_date(data_date),count(*) from default.carbontest_001
where id='LSJW26762FS044062' group by to_date(data_date)").show



===
WARN  06-12 09:02:13,763 - Lost task 5.0 in stage 44.0 (TID 687,
.com): java.lang.RuntimeException: Exception occurred in query
execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.(CarbonScanRDD.scala:226)
at 
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ERROR 06-12 09:02:14,091 - Task 1 in stage 44.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 44.0 failed 4 times, most recent failure: Lost task
1.3 in stage 44.0 (TID 694, scsp00258.saicdt.com):
java.lang.RuntimeException: Exception occurred in query
execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.(CarbonScanRDD.scala:226)
at 
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 

carbondata loading

Hi dev team,
I'm loading data from parquet file to carbondata file(DF read parquet and
save to csv then load into carbondata file). The job is blocked at "collect
at CarbonDataRDDFactory.scala:963"



*Job Id*

*Description*

*Submitted*

*Duration*

*Stages: Succeeded/Total*

*Tasks (for all stages): Succeeded/Total*

6

collect at CarbonDataRDDFactory.scala:963


2016/12/01 13:56:43

3.1 h

0/1

0/2
Completed Jobs (6)

*Job Id*

*Description*

*Submitted*

*Duration*

*Stages: Succeeded/Total*

*Tasks (for all stages): Succeeded/Total*

5

collect at GlobalDictionaryUtil.scala:800


2016/12/01 13:34:25

22 min

2/2

422/422

4

take at CarbonCsvRelation.scala:181


2016/12/01 13:34:25

0.1 s

1/1

1/1

3

saveAsTextFile at package.scala:169


2016/12/01 13:11:02

23 min

1/1

50/50

2

count at SaicSparkConvert.scala:40


2016/12/01 13:10:31

31 s

2/2

51/51

1

parquet at SaicSparkConvert.scala:35


2016/12/01 13:10:28

1 s

1/1

2/2

0

parquet at SaicSparkConvert.scala:35


2016/12/01 13:10:26

2 s

1/1

2/2


I looked into the stdout, the log are all the same warning.


WARN  01-12 13:56:46,096 - [pool-25-thread-5][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-4][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-1][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-2][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-6][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-2][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.

WARN  01-12 13:56:46,096 - [pool-25-thread-1][partitionID:carbontest]
Cannot convert : null to Numeric type value. Value considered as null.


My configuration is

--master yarn-custer

--driver-memory 8g

--executor-memory 120g

--num-executors 3


Any idea for this? Is it caused by data type?


Thanks,

Lionel


Re: carbon data

Thank you for the response Liang. I think I have followed the example but
it still returns error:
   Data loading failed. table not found: default.carbontest
attached my code below: I read data from a hive table with HiveContext and
convert it to CarbonContext then generate the df and save to hdfs. I'm not
sure whether it's correct or not when I generate the dataframe in
sc.parallelize(sc.Files,
25) Do you have any other mothod we can use to generate DF?

object SparkConvert {

  def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName("CarbonTest")

val sc = new SparkContext(conf)

val path = "hdfs:///user/appuser/lucao/CarbonTest_001.carbon"

val hqlContext = new HiveContext(sc)

val df = hqlContext.sql("select * from default.test_data_all")

println("the count is:" + df.count())

val cc = createCarbonContext(df.sqlContext.sparkContext, path)

writeDataFrame(cc, "CarbonTest", SaveMode.Append)



  }



  def createCarbonContext(sc : SparkContext, storePath : String):
CarbonContext = {

val cc = new CarbonContext(sc, storePath)

cc

  }



  def writeDataFrame(cc : CarbonContext, tableName : String, mode :
SaveMode) : Unit = {

import cc.implicits._

val sc = cc.sparkContext

val df = sc.parallelize(sc.files,
25).toDF(“col1”,”col2”,”col3”..."coln")

df.write

  .format("carbondata")

  .option("tableName", tableName)

  .option("compress", "true")

  .mode(mode)

  .save()

  }



}


carbon data

Hi team,
I'm trying to save spark dataframe to carbondata file. I see the example in
your wiki
option("tableName", "carbontable"). Does that mean I have to create a
carbondata table first and then save data into the table? Can I save it
directly without creating the carbondata table?

the code is
df.write.format("carbondata").mode(SaveMode.Append).save("hdfs:///user//data.carbon")

BTW, do you have the formal api doc?

Thanks,
Lionel