[jira] [Created] (CARBONDATA-839) Table lock file is not getting deleted after table rename is successful

2017-03-30 Thread Kunal Kapoor (JIRA)
Kunal Kapoor created CARBONDATA-839:
---

 Summary: Table lock file is not getting deleted after table rename 
is successful
 Key: CARBONDATA-839
 URL: https://issues.apache.org/jira/browse/CARBONDATA-839
 Project: CarbonData
  Issue Type: Bug
Reporter: Kunal Kapoor
Assignee: Kunal Kapoor
Priority: Minor


Table lock file is not getting deleted after table rename is successful



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-838) Alter table add decimal column with default precision and scale is failing in parser.

2017-03-30 Thread Naresh P R (JIRA)
Naresh P R created CARBONDATA-838:
-

 Summary: Alter table add decimal column with default precision and 
scale is failing in parser.
 Key: CARBONDATA-838
 URL: https://issues.apache.org/jira/browse/CARBONDATA-838
 Project: CarbonData
  Issue Type: Bug
Reporter: Naresh P R
Assignee: Naresh P R
Priority: Minor


When we add new decimal column without specifying scale and precision, alter 
table command is failing in parser.

eg., alter table test1 add columns(dcmlcol decimal)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Need help in configuring dataload.properties

2017-03-30 Thread Liang Chen
Hi

Please refer to :
https://github.com/apache/incubator-carbondata/blob/master/docs/installation-guide.md

Regards
Liang

2017-03-30 19:19 GMT+05:30 Srinath Thota :

> Hi Team,
>
>
> I have configured Carbon in spark standalone mode as per the documents and
> available web references.
>
> I tried to create and load table using reference "https://github.com/
> apache/incubator-carbondata/blob/master/docs/quick-start-guide.md" .  The
> data didn't get split  ",". Data got loaded in the table as shown below.
>
>
>
>
> Later from the code found that, I need to configure dataload.properties.
>
> Need help in setting it up.
>
>
>
>
> Thanks &  Regards,
>
> Srinath Thota
>
> L Technology Services
>
> Mobile# +91-91601-61716
> E-mail: srinath.th...@outlook.com
>
>


-- 
Regards
Liang


Need help in configuring dataload.properties

2017-03-30 Thread Srinath Thota
Hi Team,


I have configured Carbon in spark standalone mode as per the documents and 
available web references.

I tried to create and load table using reference 
"https://github.com/apache/incubator-carbondata/blob/master/docs/quick-start-guide.md;
 .  The data didn't get split  ",". Data got loaded in the table as shown below.


[cid:23249ff3-8c21-4d4e-9635-346dac6a3f32]


Later from the code found that, I need to configure dataload.properties.

Need help in setting it up.




Thanks &  Regards,

Srinath Thota

L Technology Services

Mobile# +91-91601-61716

E-mail: srinath.th...@outlook.com


Re:Re:Re: Load data into carbondata executors distributed unevenly

2017-03-30 Thread a
Yes ,it is.Babu,thanks for your help!


Best regards!
At 2017-03-30 17:32:07, "babu lal jangir"  wrote:
>Hi
>Please refer below jira id . I guess your issue is same .
>CARBONDATA-830
>(Data loading scheduling has some issue)
>
>Thanks
>Babu
>On Mar 30, 2017 12:26, "a"  wrote:
>
>> add attachments
>>
>>
>> At 2017-03-30 10:38:08, "Ravindra Pesala"  wrote:
>> >Hi, > >It seems attachments are missing.Can you attach them again. >
>> >Regards, >Ravindra. > >On 30 March 2017 at 08:02, a 
>> wrote: > >> Hello! >> >> *Test result:* >> When I load csv data into
>> carbondata table 3 times,the executors >> distributed unevenly。My purpose
>> >> > mMn3HW6HbyCQR1aDIDSiAZdAZGWEda5ZonK2CFcNh_wXtsSW0YVa_n0NK-dBg3708mv1qeXm>
>> is >> one node one task,but the result is some node has 2 task and some
>> node has >> no task。 >> See the load data 1.png,data 2.png,data 3.png。 >>
>> The carbondata data.PNG is the data structure in hadoop. >> >> I load 4
>>   records into carbondata table takes 2629s seconds,its >> too
>> long。 >> >> *Question:* >> How can i make the executors distributed evenly
>> ? >> >> The environment: >> spark2.1+carbondata1.1,there are 7 datanodes.
>> >> >> *./bin/spark-shell \--master yarn \--deploy-mode client >>
>> \--num-executors n \ (the first time is 7(result in load data 1.png),the >>
>> second time is 6(result in load data 2.png),the three time is 8(result in
>> >> load data3.png))--executor-cores 10 \--executor-memory 40G
>> \--driver-memory >> 8G \* >> >> carbon.properties >>  DataLoading
>> Configuration  >> carbon.sort.file.buffer.size=20 >>
>> carbon.graph.rowset.size=1 >> carbon.number.of.cores.while.loading=10
>> >> carbon.sort.size=5 >> carbon.number.of.cores.while.compacting=10
>> >> carbon.number.of.cores=10 >> >> Best regards! >> >> >> >> >> >> >> > > >
>> >-- >Thanks & Regards, >Ravi
>>
>>


[jira] [Created] (CARBONDATA-837) Unable to delete records from carbondata table

2017-03-30 Thread Sanoj MG (JIRA)
Sanoj MG created CARBONDATA-837:
---

 Summary: Unable to delete records from carbondata table
 Key: CARBONDATA-837
 URL: https://issues.apache.org/jira/browse/CARBONDATA-837
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.1.0-incubating
 Environment: HDP 2.5, Spark 1.6.2
Reporter: Sanoj MG
Priority: Minor
 Fix For: NONE


As per below document I am trying to delete entries from the table :
https://github.com/apache/incubator-carbondata/blob/master/docs/dml-operation-on-carbondata.md


scala> cc.sql("select * from accountentity").count
res10: Long = 391351

scala> cc.sql("delete from accountentity")
INFO  30-03 09:03:03,099 - main Query [DELETE FROM ACCOUNTENTITY]
INFO  30-03 09:03:03,104 - Parsing command: select tupleId from accountentity
INFO  30-03 09:03:03,104 - Parse Completed
INFO  30-03 09:03:03,105 - Parsing command: select tupleId from accountentity
INFO  30-03 09:03:03,105 - Parse Completed
res11: org.apache.spark.sql.DataFrame = []

scala> cc.sql("select * from accountentity").count
res12: Long = 391351

The records gets deleted only when an action such as show() is applied. 

scala> cc.sql("delete from accountentity").show





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Deleting records from carabondata table

2017-03-30 Thread sounak
Hi Sanoj,

Yes, you can open a JIRA for this.

Thanks
Sounak

On Thu, Mar 30, 2017 at 3:54 PM, Sanoj MG 
wrote:

> Hi Sounak,
>
> It is getting deleted when I add an action, such as show() as you
> suggested.
>
> +-+-++--
> --+
> |SegmentSequenceId|   Status| Load Start Time|   Load End
> Time|
> +-+-++--
> --+
> |2|Marked for Delete|2017-03-30 10:26:...|2017-03-30
> 10:26:...|
> |1|Marked for Delete|2017-03-30 09:01:...|2017-03-30
> 09:01:...|
> +-+-++--
> --+
>
>
> It looks like an issue to me, if not the documentation may need an update
> to indicate this. Can I raise a JIRA?
>
>
> Thanks
>
>
>
>
>
>
>
> On Thu, Mar 30, 2017 at 1:29 PM, sounak  wrote:
>
> > Hi Sanoj,
> >
> > Can you please try
> >
> > sql("delete from accountentity").show()
> >
> > Thanks
> > Sounak
> >
> > On Thu, Mar 30, 2017 at 2:31 PM, Sanoj MG 
> > wrote:
> >
> > > Hi All,
> > >
> > > Is delete records from carbondata table supported?
> > >
> > >
> > > As per below doc I am trying to delete entries from the table :
> > > https://github.com/apache/incubator-carbondata/blob/
> > > master/docs/dml-operation-on-carbondata.md
> > >
> > >
> > > scala> cc.sql("select * from accountentity").count
> > > res10: Long = 391351
> > >
> > > scala> cc.sql("delete from accountentity")
> > > INFO  30-03 09:03:03,099 - main Query [DELETE FROM ACCOUNTENTITY]
> > > INFO  30-03 09:03:03,104 - Parsing command: select tupleId from
> > > accountentity
> > > INFO  30-03 09:03:03,104 - Parse Completed
> > > INFO  30-03 09:03:03,105 - Parsing command: select tupleId from
> > > accountentity
> > > INFO  30-03 09:03:03,105 - Parse Completed
> > > res11: org.apache.spark.sql.DataFrame = []
> > >
> > > scala> cc.sql("select * from accountentity").count
> > > res12: Long = 391351
> > >
> > > Is deletion some sort of lazy operation?
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>



-- 
Thanks
Sounak


Re: Deleting records from carabondata table

2017-03-30 Thread Sanoj MG
Hi Sounak,

It is getting deleted when I add an action, such as show() as you
suggested.

+-+-+++
|SegmentSequenceId|   Status| Load Start Time|   Load End
Time|
+-+-+++
|2|Marked for Delete|2017-03-30 10:26:...|2017-03-30
10:26:...|
|1|Marked for Delete|2017-03-30 09:01:...|2017-03-30
09:01:...|
+-+-+++


It looks like an issue to me, if not the documentation may need an update
to indicate this. Can I raise a JIRA?


Thanks







On Thu, Mar 30, 2017 at 1:29 PM, sounak  wrote:

> Hi Sanoj,
>
> Can you please try
>
> sql("delete from accountentity").show()
>
> Thanks
> Sounak
>
> On Thu, Mar 30, 2017 at 2:31 PM, Sanoj MG 
> wrote:
>
> > Hi All,
> >
> > Is delete records from carbondata table supported?
> >
> >
> > As per below doc I am trying to delete entries from the table :
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/dml-operation-on-carbondata.md
> >
> >
> > scala> cc.sql("select * from accountentity").count
> > res10: Long = 391351
> >
> > scala> cc.sql("delete from accountentity")
> > INFO  30-03 09:03:03,099 - main Query [DELETE FROM ACCOUNTENTITY]
> > INFO  30-03 09:03:03,104 - Parsing command: select tupleId from
> > accountentity
> > INFO  30-03 09:03:03,104 - Parse Completed
> > INFO  30-03 09:03:03,105 - Parsing command: select tupleId from
> > accountentity
> > INFO  30-03 09:03:03,105 - Parse Completed
> > res11: org.apache.spark.sql.DataFrame = []
> >
> > scala> cc.sql("select * from accountentity").count
> > res12: Long = 391351
> >
> > Is deletion some sort of lazy operation?
> >
>
>
>
> --
> Thanks
> Sounak
>


Re: Re:Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread Lu Cao
@Liang, Yes, actually I'm currently working on the limit query
optimization.
I get the limited dictionary value and convert to the filter condition in
CarbonOptimizer step.
It would definitely improve the query performance in some scenario.

On Thu, Mar 30, 2017 at 2:07 PM, Liang Chen  wrote:

> Hi
>
> +1 for simafengyun's optimization, it looks good to me.
>
> I propose to do "limit" pushdown first, similar with filter pushdown. what
> is your opionion? @simafengyun
>
> For "order by" pushdown, let us work out an ideal solution to consider all
> aggregation push down cases. Ravindara's comment is reasonable, we need to
> consider decoupling spark and carbondata, otherwise maintenance cost might
> be high if do computing works at both side, because we need to keep
> utilizing Spark' computing capability along with its version evolution.
>
> Regards
> Liang
>
>
> simafengyun wrote
> > Hi Ravindran,
> > yes, use carbon do the sorting if the order by column is not first
> > column.But its sorting is very high since the dimension data in blocklet
> > is stored after sorting.So in carbon can use  merge sort  + topN to get N
> > data from each block.In addition,  the biggest difference is that it can
> > reduce disk IO since can use limit n to reduce required blocklets.if you
> > only apply spark's top N, I don't think you can make  suck below
> > performance. That's impossible  if don't reduce disk IO.
> >
>  n5.nabble.com/file/n9834/%E6%9C%AA%E5%91%BD%E5%90%8D2.jpg>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2017-03-30 03:12:54, "Ravindra Pesala" ravi.pes...@gmail.com
> > wrote:
> >>Hi,
> >>
> >>You mean Carbon do the sorting if the order by column is not first column
> >>and provide only limit values to spark. But the same job spark is also
> >>doing it just sorts the partition and gets the top values out of it. You
> >>can reduce the table_blocksize to get the better sort performance as
> spark
> >>try to do sorting inside memory.
> >>
> >>I can see we can do some optimizations in integration layer itself with
> out
> >>pushing down any logic to carbon like if the order by column is first
> >>column then we can just get limit values with out sorting any data.
> >>
> >>Regards,
> >>Ravindra.
> >>
> >>On 29 March 2017 at 08:58, 马云 simafengyun1...@163.com wrote:
> >>
> >>> Hi Ravindran,
> >>> Thanks for your quick response. please see my answer as below
> >>> 
> >>>  What if the order by column is not the first column? It needs to scan
> >>> all
> >>> blocklets to get the data out of it if the order by column is not first
> >>> column of mdk
> >>> 
> >>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs
> >>> to
> >>> scan all blocklets to get the data out of it if the order by column is
> >>> not
> >>> first column of mdk
> >>> but it just scan all the order by column's data, for
> >>> others columns data,  use the lazy-load strategy and  it can reduce
> scan
> >>> accordingly to  limit value.
> >>> Hence you can see the performance is much better now
> >>> after  my optimization. Currently the carbondata order by + limit
> >>> performance is very bad since it scans all data.
> >>>in my test there are  20,000,000 data, it takes more
> than
> >>> 10s, if data is much more huge,  I think it is hard for user to stand
> >>> such
> >>> bad performance when they do order by + limit  query?
> >>>
> >>>
> >>> 
> >>>  We used to have multiple push down optimizations from spark to carbon
> >>> like aggregation, limit, topn etc. But later it was removed because it
> >>> is
> >>> very hard to maintain for version to version. I feel it is better that
> >>> execution engine like spark can do these type of operations.
> >>> 
> >>> Answer : In my opinion, I don't think "hard to maintain for version to
> >>> version" is a good reason to give up the order by  + limit
> optimization.
> >>> I think it can create new class to extends current and try to reduce
> the
> >>> impact for the current code. Maybe can make it is easy to maintain.
> >>> Maybe I am wrong.
> >>>
> >>>
> >>>
> >>>
> >>> At 2017-03-29 02:21:58, "Ravindra Pesala" ravi.pes...@gmail.com
> 
> >>> wrote:
> >>>
> >>>
> >>> Hi Jarck Ma,
> >>>
> >>> It is great to try optimizing Carbondata.
> >>> I think this solution comes up with many limitations. What if the order
> >>> by
> >>> column is not the first column? It needs to scan all blocklets to get
> >>> the
> >>> data out of it if the order by column is not first column of mdk.
> >>>
> >>> We used to have multiple push down optimizations from spark to carbon
> >>> like
> >>> aggregation, limit, topn etc. But later it was removed because it is
> >>> very
> >>> 

Re:Re: Load data into carbondata executors distributed unevenly

2017-03-30 Thread babu lal jangir
Hi
Please refer below jira id . I guess your issue is same .
CARBONDATA-830
(Data loading scheduling has some issue)

Thanks
Babu
On Mar 30, 2017 12:26, "a"  wrote:

> add attachments
>
>
> At 2017-03-30 10:38:08, "Ravindra Pesala"  wrote:
> >Hi, > >It seems attachments are missing.Can you attach them again. >
> >Regards, >Ravindra. > >On 30 March 2017 at 08:02, a 
> wrote: > >> Hello! >> >> *Test result:* >> When I load csv data into
> carbondata table 3 times,the executors >> distributed unevenly。My purpose
> >>  mMn3HW6HbyCQR1aDIDSiAZdAZGWEda5ZonK2CFcNh_wXtsSW0YVa_n0NK-dBg3708mv1qeXm>
> is >> one node one task,but the result is some node has 2 task and some
> node has >> no task。 >> See the load data 1.png,data 2.png,data 3.png。 >>
> The carbondata data.PNG is the data structure in hadoop. >> >> I load 4
>   records into carbondata table takes 2629s seconds,its >> too
> long。 >> >> *Question:* >> How can i make the executors distributed evenly
> ? >> >> The environment: >> spark2.1+carbondata1.1,there are 7 datanodes.
> >> >> *./bin/spark-shell \--master yarn \--deploy-mode client >>
> \--num-executors n \ (the first time is 7(result in load data 1.png),the >>
> second time is 6(result in load data 2.png),the three time is 8(result in
> >> load data3.png))--executor-cores 10 \--executor-memory 40G
> \--driver-memory >> 8G \* >> >> carbon.properties >>  DataLoading
> Configuration  >> carbon.sort.file.buffer.size=20 >>
> carbon.graph.rowset.size=1 >> carbon.number.of.cores.while.loading=10
> >> carbon.sort.size=5 >> carbon.number.of.cores.while.compacting=10
> >> carbon.number.of.cores=10 >> >> Best regards! >> >> >> >> >> >> >> > > >
> >-- >Thanks & Regards, >Ravi
>
>


Re: Deleting records from carabondata table

2017-03-30 Thread sounak
Hi Sanoj,

Can you please try

sql("delete from accountentity").show()

Thanks
Sounak

On Thu, Mar 30, 2017 at 2:31 PM, Sanoj MG 
wrote:

> Hi All,
>
> Is delete records from carbondata table supported?
>
>
> As per below doc I am trying to delete entries from the table :
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/dml-operation-on-carbondata.md
>
>
> scala> cc.sql("select * from accountentity").count
> res10: Long = 391351
>
> scala> cc.sql("delete from accountentity")
> INFO  30-03 09:03:03,099 - main Query [DELETE FROM ACCOUNTENTITY]
> INFO  30-03 09:03:03,104 - Parsing command: select tupleId from
> accountentity
> INFO  30-03 09:03:03,104 - Parse Completed
> INFO  30-03 09:03:03,105 - Parsing command: select tupleId from
> accountentity
> INFO  30-03 09:03:03,105 - Parse Completed
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> cc.sql("select * from accountentity").count
> res12: Long = 391351
>
> Is deletion some sort of lazy operation?
>



-- 
Thanks
Sounak


Deleting records from carabondata table

2017-03-30 Thread Sanoj MG
Hi All,

Is delete records from carbondata table supported?


As per below doc I am trying to delete entries from the table :
https://github.com/apache/incubator-carbondata/blob/master/docs/dml-operation-on-carbondata.md


scala> cc.sql("select * from accountentity").count
res10: Long = 391351

scala> cc.sql("delete from accountentity")
INFO  30-03 09:03:03,099 - main Query [DELETE FROM ACCOUNTENTITY]
INFO  30-03 09:03:03,104 - Parsing command: select tupleId from
accountentity
INFO  30-03 09:03:03,104 - Parse Completed
INFO  30-03 09:03:03,105 - Parsing command: select tupleId from
accountentity
INFO  30-03 09:03:03,105 - Parse Completed
res11: org.apache.spark.sql.DataFrame = []

scala> cc.sql("select * from accountentity").count
res12: Long = 391351

Is deletion some sort of lazy operation?


[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

2017-03-30 Thread Sanoj MG (JIRA)
Sanoj MG created CARBONDATA-836:
---

 Summary: Error in load using dataframe  - columns containing comma
 Key: CARBONDATA-836
 URL: https://issues.apache.org/jira/browse/CARBONDATA-836
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.1.0-incubating
 Environment: HDP sandbox 2.5, Spark 1.6.2
Reporter: Sanoj MG
Priority: Minor
 Fix For: NONE


While trying to load data into Carabondata table using dataframe, the columns 
containing commas are not properly loaded. 

Eg: 
scala> df.show(false)
+---+--+---++-+--+
|Country|Branch|Name   |Address |ShortName|Status|
+---+--+---++-+--+
|2  |1 |Main Branch|, Dubai, UAE|UHO  |256   |
+---+--+---++-+--+


scala>  df.write.format("carbondata").option("tableName", 
"Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()


scala> cc.sql("select * from branch1").show(false)

+---+--+---+---+-+--+
|country|branch|name   |address|shortname|status|
+---+--+---+---+-+--+
|2  |1 |Main Branch|   | Dubai   |null  |
+---+--+---+---+-+--+






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread simafengyun
add images for my last post


1. about how to work for dictionary columns and guarantee to get a sort
result
*Answer:*  As I know the dictionary allocation is in sorted order  in block
level, even in segment level, not in global level.
But that's enough, please see the below example about how to get the correct
sort result


 

2.about how to reduce io


 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-tp9764p9863.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re:Re: Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread 马云
Hi, 
Please see my answer.

1. It cannot work for dictionary columns. As there is no guarantee that 
dictionary allocation is in sorted order. 


Answer:  As I know the dictionary allocation is in sorted order  in block 
level, even in segment level, not in global level.
But that's enough, please see the below example about how to get the correct 
sort result

2. It cannot work for no inverted index columns. 


Answer: One question, by default if user didn't configure a dimension column 
with no inverted index when create table.
But after data loading, it has no inverted index in deed. in this case, is it 
still  in sorted order  in blocklet(or datachunk2) level? 
if yes,  it's physical row id should equals the logical row id, of cause it is 
not necessary to create  inverted index.
Hence by default the solution should work for this kind of case.

3. It cannot work for measures. 

Answer: yes, you are right, can't support with measure column since it is not  
in sorted order 

Moreover as you mentioned that it can reduce IO, But I don't think we can 
reduce any IO since we need to read all blocklets to do merge sort. And I 
am not sure how we can keep all the data in memory until we do merge sort. 


Answer: we don't need to load all selected columns' data in memory.
only need to load all the order by column' data in memory. others columns can 
use lazy-loading.
We also can set a orderby-optimization flag, if user think they have not too 
much memory, can switch it to off.
about reduce io, there are 2 ways.
1. sort blocklets according to the orderby dimensions' min(or max) values.
then according to limit value to get the required blocklet for scan. 
2. reduce the io for non order by" columns data since they 


I am still believe that this work is belonged to execution engine, not file 
format. This type of specific improvements may give good performance in 
some specific type of queries but these will give long term complications 
in maintainability. 








At 2017-03-30 11:58:06, "Ravindra Pesala"  wrote:
>Hi,
>
>It comes up with many limitations
>1. It cannot work for dictionary columns. As there is no guarantee that
>dictionary allocation is in sorted order.
>2. It cannot work for no inverted index columns.
>3. It cannot work for measures.
>
>Moreover as you mentioned that it can reduce IO, But I don't think we can
>reduce any IO since we need to read all blocklets to do merge sort. And I
>am not sure how we can keep all the data in memory until we do merge sort.
>I am still believe that this work is belonged to execution engine, not file
>format. This type of specific improvements may give good performance in
>some specific type of queries but these will give long term complications
>in maintainability.
>
>
>Regards,
>Ravindra.
>
>On 30 March 2017 at 08:23, 马云  wrote:
>
>> Hi Ravindran,
>>
>> yes, use carbon do the sorting if the order by column is not first column.
>>
>> But its sorting is very high since the dimension data in blocklet is stored 
>> after sorting.
>>
>> So in carbon can use  merge sort  + topN to get N data from each block.
>>
>> In addition,  the biggest difference is that it can reduce disk IO since can 
>> use limit n to reduce required blocklets.
>>
>> if you only apply spark's top N, I don't think you can make  suck below 
>> performance.
>>
>> That's impossible  if don't reduce disk IO.
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2017-03-30 03:12:54, "Ravindra Pesala"  wrote:
>> >Hi,
>> >
>> >You mean Carbon do the sorting if the order by column is not first column
>> >and provide only limit values to spark. But the same job spark is also
>> >doing it just sorts the partition and gets the top values out of it. You
>> >can reduce the table_blocksize to get the better sort performance as spark
>> >try to do sorting inside memory.
>> >
>> >I can see we can do some optimizations in integration layer itself with out
>> >pushing down any logic to carbon like if the order by column is first
>> >column then we can just get limit values with out sorting any data.
>> >
>> >Regards,
>> >Ravindra.
>> >
>> >On 29 March 2017 at 08:58, 马云  wrote:
>> >
>> >> Hi Ravindran,
>> >> Thanks for your quick response. please see my answer as below
>> >> 
>> >>  What if the order by column is not the first column? It needs to scan all
>> >> blocklets to get the data out of it if the order by column is not first
>> >> column of mdk
>> >> 
>> >> Answer :  if step2 doesn't filter any blocklet, you are right,It needs to
>> >> scan all blocklets to get the data out of it if the order by column is not
>> >> first column of mdk
>> >> but it just scan all the order by column's data, for
>> >> others columns data,  use the lazy-load strategy and  it can reduce scan
>> >> accordingly to  limit value.
>> >>

[jira] [Created] (CARBONDATA-835) Null values in carbon table gives a NullPointerException when querying from Presto

2017-03-30 Thread Bhavya Aggarwal (JIRA)
Bhavya Aggarwal created CARBONDATA-835:
--

 Summary: Null values in carbon table gives a NullPointerException 
when querying from Presto
 Key: CARBONDATA-835
 URL: https://issues.apache.org/jira/browse/CARBONDATA-835
 Project: CarbonData
  Issue Type: Bug
  Components: presto-integration
 Environment: Presto
Reporter: Bhavya Aggarwal
Assignee: Bhavya Aggarwal
Priority: Minor


Null values in carbon table gives a NullPointerException when querying from 
Presto



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-834) Describe Table in Presto gives incorrect order of columns

2017-03-30 Thread Bhavya Aggarwal (JIRA)
Bhavya Aggarwal created CARBONDATA-834:
--

 Summary: Describe Table in Presto gives incorrect order of columns
 Key: CARBONDATA-834
 URL: https://issues.apache.org/jira/browse/CARBONDATA-834
 Project: CarbonData
  Issue Type: Bug
  Components: presto-integration
 Environment: Presto
Reporter: Bhavya Aggarwal
Assignee: Bhavya Aggarwal
Priority: Minor


Describe Table in Presto gives incorrect order of columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

2017-03-30 Thread ZhuWilliam
The Design is too complex  and we may spend to much time and people to
develop it. Can we simplify it and just support streaming first?



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-New-Feature-Streaming-Ingestion-into-CarbonData-tp9724p9855.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re:Re: Load data into carbondata executors distributed unevenly

2017-03-30 Thread a
add attachments 

At 2017-03-30 10:38:08, "Ravindra Pesala"  wrote: >Hi, > 
>It seems attachments are missing.Can you attach them again. > >Regards, 
>Ravindra. > >On 30 March 2017 at 08:02, a  wrote: > >> Hello! 
>> >> *Test result:* >> When I load csv data into carbondata table 3 times,the 
executors >> distributed unevenly。My purpose >> 

 is >> one node one task,but the result is some node has 2 task and some node 
has >> no task。 >> See the load data 1.png,data 2.png,data 3.png。 >> The 
carbondata data.PNG is the data structure in hadoop. >> >> I load 4   
records into carbondata table takes 2629s seconds,its >> too long。 >> >> 
*Question:* >> How can i make the executors distributed evenly ? >> >> The 
environment: >> spark2.1+carbondata1.1,there are 7 datanodes. >> >> 
*./bin/spark-shell \--master yarn \--deploy-mode client >> \--num-executors n \ 
(the first time is 7(result in load data 1.png),the >> second time is 6(result 
in load data 2.png),the three time is 8(result in >> load 
data3.png))--executor-cores 10 \--executor-memory 40G \--driver-memory >> 8G \* 
>> >> carbon.properties >>  DataLoading Configuration  >> 
carbon.sort.file.buffer.size=20 >> carbon.graph.rowset.size=1 >> 
carbon.number.of.cores.while.loading=10 >> carbon.sort.size=5 >> 
carbon.number.of.cores.while.compacting=10 >> carbon.number.of.cores=10 >> >> 
Best regards! >> >> >> >> >> >> >> > > > >-- >Thanks & Regards, >Ravi

[jira] [Created] (CARBONDATA-833) load data from dataframe,generater data row may be error when delimiterLevel1 or delimiterLevel2 is special character

2017-03-30 Thread tianli (JIRA)
tianli created CARBONDATA-833:
-

 Summary: load data from dataframe,generater data row may be error 
when delimiterLevel1 or delimiterLevel2 is special character
 Key: CARBONDATA-833
 URL: https://issues.apache.org/jira/browse/CARBONDATA-833
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.0.0-incubating, 1.1.0-incubating
Reporter: tianli
Assignee: tianli
 Fix For: 1.1.0-incubating, 1.0.0-incubating


 load data from dataframe,generater data row may be error by delimiterLevel1 or 
delimiterLevel2 is special character 
  because delimiterLevel1 and delimiterLevel2 when carbonLoadModel is create by 
CarbonUtil.delimiterConverter(), CarbonScalaUtil.getString direct use 
carbonLoadModel.getComplexDelimiterLevel1 and 
carbonLoadModel.getComplexDelimiterLevel2 
val delimiter = if (level == 1) {
delimiterLevel1
  } else {
delimiterLevel2
  }
  val builder = new StringBuilder()
  s.foreach { x =>
builder.append(getString(x, serializationNullFormat, 
delimiterLevel1,
delimiterLevel2, timeStampFormat, dateFormat, level + 
1)).append(delimiter)
  }
make  primitive data  added a more char \ when datatype is complex 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Re:Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread Liang Chen
Hi

+1 for simafengyun's optimization, it looks good to me.

I propose to do "limit" pushdown first, similar with filter pushdown. what
is your opionion? @simafengyun

For "order by" pushdown, let us work out an ideal solution to consider all
aggregation push down cases. Ravindara's comment is reasonable, we need to
consider decoupling spark and carbondata, otherwise maintenance cost might
be high if do computing works at both side, because we need to keep
utilizing Spark' computing capability along with its version evolution.

Regards
Liang


simafengyun wrote
> Hi Ravindran,
> yes, use carbon do the sorting if the order by column is not first
> column.But its sorting is very high since the dimension data in blocklet
> is stored after sorting.So in carbon can use  merge sort  + topN to get N
> data from each block.In addition,  the biggest difference is that it can
> reduce disk IO since can use limit n to reduce required blocklets.if you
> only apply spark's top N, I don't think you can make  suck below
> performance. That's impossible  if don't reduce disk IO.
> 

 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> At 2017-03-30 03:12:54, "Ravindra Pesala" ravi.pes...@gmail.com
> wrote:
>>Hi,
>>
>>You mean Carbon do the sorting if the order by column is not first column
>>and provide only limit values to spark. But the same job spark is also
>>doing it just sorts the partition and gets the top values out of it. You
>>can reduce the table_blocksize to get the better sort performance as spark
>>try to do sorting inside memory.
>>
>>I can see we can do some optimizations in integration layer itself with
out
>>pushing down any logic to carbon like if the order by column is first
>>column then we can just get limit values with out sorting any data.
>>
>>Regards,
>>Ravindra.
>>
>>On 29 March 2017 at 08:58, 马云 simafengyun1...@163.com wrote:
>>
>>> Hi Ravindran,
>>> Thanks for your quick response. please see my answer as below
>>> 
>>>  What if the order by column is not the first column? It needs to scan
>>> all
>>> blocklets to get the data out of it if the order by column is not first
>>> column of mdk
>>> 
>>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs
>>> to
>>> scan all blocklets to get the data out of it if the order by column is
>>> not
>>> first column of mdk
>>> but it just scan all the order by column's data, for
>>> others columns data,  use the lazy-load strategy and  it can reduce scan
>>> accordingly to  limit value.
>>> Hence you can see the performance is much better now
>>> after  my optimization. Currently the carbondata order by + limit
>>> performance is very bad since it scans all data.
>>>in my test there are  20,000,000 data, it takes more than
>>> 10s, if data is much more huge,  I think it is hard for user to stand
>>> such
>>> bad performance when they do order by + limit  query?
>>>
>>>
>>> 
>>>  We used to have multiple push down optimizations from spark to carbon
>>> like aggregation, limit, topn etc. But later it was removed because it
>>> is
>>> very hard to maintain for version to version. I feel it is better that
>>> execution engine like spark can do these type of operations.
>>> 
>>> Answer : In my opinion, I don't think "hard to maintain for version to
>>> version" is a good reason to give up the order by  + limit optimization.
>>> I think it can create new class to extends current and try to reduce the
>>> impact for the current code. Maybe can make it is easy to maintain.
>>> Maybe I am wrong.
>>>
>>>
>>>
>>>
>>> At 2017-03-29 02:21:58, "Ravindra Pesala" ravi.pes...@gmail.com
>>> wrote:
>>>
>>>
>>> Hi Jarck Ma,
>>>
>>> It is great to try optimizing Carbondata.
>>> I think this solution comes up with many limitations. What if the order
>>> by
>>> column is not the first column? It needs to scan all blocklets to get
>>> the
>>> data out of it if the order by column is not first column of mdk.
>>>
>>> We used to have multiple push down optimizations from spark to carbon
>>> like
>>> aggregation, limit, topn etc. But later it was removed because it is
>>> very
>>> hard to maintain for version to version. I feel it is better that
>>> execution
>>> engine like spark can do these type of operations.
>>>
>>>
>>> Regards,
>>> Ravindra.
>>>
>>>
>>>
>>> On Tue, Mar 28, 2017, 14:28 马云 simafengyun1...@163.com wrote:
>>>
>>>
>>> Hi Carbon Dev,
>>>
>>> Currently I have done optimization for ordering by 1 dimension.
>>>
>>> my local performance test as below. Please give your suggestion.
>>>
>>>
>>>
>>>
>>> | data count | test sql | limit value in sql | performance(ms) |
>>> | optimized code | original code |
>>> |