[GitHub] incubator-carbondata-site issue #2: Home Page changes

2016-12-19 Thread sgururajshetty
Github user sgururajshetty commented on the issue:

https://github.com/apache/incubator-carbondata-site/pull/2
  
@chenliang613 please review the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (CARBONDATA-545) Carbon Query GC Problem

2016-12-19 Thread kumar vishal (JIRA)
kumar vishal created CARBONDATA-545:
---

 Summary: Carbon Query GC Problem
 Key: CARBONDATA-545
 URL: https://issues.apache.org/jira/browse/CARBONDATA-545
 Project: CarbonData
  Issue Type: Improvement
  Components: data-query
Reporter: kumar vishal
Assignee: kumar vishal
 Fix For: 1.0.0-incubating


Problem
There are lots of gc when carbon is processing more number of records during 
query, which is impacting carbon query performance.To solve this gc problem 
happening when query output is too huge or when more number of records are 
processed.

Solution
Currently we are storing all the data which is read during query from carbon 
data file in heap, when number of query output is huge it is causing more gc. 
Instead of storing in heap we can store this data in offheap and will clear 
when scanning is finished for that query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Improvement] Carbon query gc problem

2016-12-19 Thread An Lan
+1, I have suffered from gc problem. As I understand, the BatchResult will
be cached and continue to be kept in memory for a little long term, which
cause a lot of data be moved from Young to Old. It is better to move it to
off-heap.

2016-12-20 11:57 GMT+08:00 ZhuWilliam :

> +1   Heap should not store data ,it should be used to store runtime temp
> data.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-
> Carbon-query-gc-problem-tp4322p4718.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: [Improvement] Carbon query gc problem

2016-12-19 Thread ZhuWilliam
+1   Heap should not store data ,it should be used to store runtime temp
data. 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Carbon-query-gc-problem-tp4322p4718.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Improvement] Carbon query gc problem

2016-12-19 Thread Liang Chen
Hi+1,Store data in offheap to avoid gc problem , the solution will help
performance more.
Kumar Vishal wrote
> There are lots of gc when carbon is processing more number of
> recordsduring query, which is impacting carbon query performance.To solve
> this gcproblem happening when query output is too huge or when more number
> ofrecords are processed, I would like to propose below solution.Currently
> we are storing all the data which is read during query fromcarbon data
> file in heap, when number of query output is huge it is causingmore gc.
> Instead of storing in heap we can store this data in offheap andwill clear
> when scanning is finished for that query.Please vote and comment for above
> proposal.-RegardsKUmar Vishal





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Carbon-query-gc-problem-tp4322p4717.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

?????? InvalidInputException when loading data to table

2016-12-19 Thread 251469031
OK, thx~


It's a local path,  well, in the error log, it shows that the dataFilePath is 
set to  /home/hadoop/carbondata/sample.csv,  and it is where my test file 
located. @see the log:


Input path does not exist: /home/hadoop/carbondata/sample.csv


in the following command, is the package of class File is java.io.File?
scala>val dataFilePath = new File("../carbondata/sample.csv").getCanonicalPath






--  --
??: "Liang Chen";;
: 2016??12??20??(??) 8:35
??: "dev"; 

: Re: InvalidInputException when loading data to table



Hi

1.Your input path is hadoop or local ? Please double check your input path
if it is correct.
2.As a new starter, suggest you use IntellijIDEA to open CarbonData , and
run all examples.

Regards
Liang
--
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: /home/hadoop/carbondata/sample.csv
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(
NewHadoopRDD.scala:113)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(
RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(
RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)

2016-12-19 20:24 GMT+08:00 251469031 <251469...@qq.com>:

> Hi all,
>
> I'm now learning how to getting started with carbondata according to
> the tutorial: https://cwiki.apache.org/confluence/display/CARBONDATA/
> Quick+Start.
>
>
> I created a file named sample.csv under the path
> /home/hadoop/carbondata at the master node, and when I run the script:
>
>
> scala>val dataFilePath = new File("../carbondata/sample.
> csv").getCanonicalPath
> scala>cc.sql(s"load data inpath '$dataFilePath' into table test_table")
>
>
> it turns out a "InvalidInputException" while the file is acctually exist,
> here is the scripts and logs:
>
>
> scala> val dataFilePath = new File("../carbondata/sample.
> csv").getCanonicalPath
> dataFilePath: String = /home/hadoop/carbondata/sample.csv
>
>
> scala> cc.sql(s"load data inpath '$dataFilePath' into table test_table")
> INFO  19-12 20:18:22,991 - main Query [LOAD DATA INPATH
> '/HOME/HADOOP/CARBONDATA/SAMPLE.CSV' INTO TABLE TEST_TABLE]
> INFO  19-12 20:18:23,271 - Successfully able to get the table metadata
> file lock
> INFO  19-12 20:18:23,276 - main Initiating Direct Load for the Table :
> (default.test_table)
> INFO  19-12 20:18:23,279 - main Generate global dictionary from source
> data files!
> INFO  19-12 20:18:23,296 - main [Block Distribution]
> INFO  19-12 20:18:23,297 - main totalInputSpaceConsumed: 74 ,
> defaultParallelism: 28
> INFO  19-12 20:18:23,297 - main mapreduce.input.fileinputformat.split.maxsize:
> 16777216
> INFO  19-12 20:18:23,380 - Block broadcast_0 stored as values in memory
> (estimated size 137.1 KB, free 137.1 KB)
> INFO  19-12 20:18:23,397 - Block broadcast_0_piece0 stored as bytes in
> memory (estimated size 15.0 KB, free 152.1 KB)
> INFO  19-12 20:18:23,398 - Added broadcast_0_piece0 in memory on
> 172.17.195.12:46335 (size: 15.0 KB, free: 511.1 MB)
> INFO  19-12 20:18:23,399 - Created broadcast 0 from NewHadoopRDD at
> CarbonTextFile.scala:73
> ERROR 19-12 20:18:23,431 - main generate global dictionary failed
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does not exist: /home/hadoop/carbondata/sample.csv
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
> listStatus(FileInputFormat.java:285)
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
> getSplits(FileInputFormat.java:340)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(
> NewHadoopRDD.scala:113)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(
> RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(
> RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> ...
>
>
> If any of you have met the same problem, would you tell me why this
> happen, looking forward to your replay, thx~




-- 
Regards
Liang

Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-19 Thread Kumar Vishal
+1
Now user will have flexibility to choose the output format.Will get
performance benefit if dictionary files are already generated.

-Regards
Kumar Vishal


On Fri, Dec 16, 2016 at 10:19 AM, Ravindra Pesala 
wrote:

> +1 to have separate output formats, now user can have flexibility to choose
> as per scenario.
>
> On Fri, Dec 16, 2016, 2:47 AM Jihong Ma  wrote:
>
> >
> > It is great idea to have separate OutputFormat for regular Carbon data
> > files, index files as well as meta data files, For instance: dictionary
> > file, schema file, global index file etc.. for writing Carbon generated
> > files laid out HDFS, and it is orthogonal to the actual data load
> process.
> >
> > Regards.
> >
> > Jihong
> >
> > -Original Message-
> > From: Jacky Li [mailto:jacky.li...@qq.com]
> > Sent: Thursday, December 15, 2016 12:55 AM
> > To: dev@carbondata.incubator.apache.org
> > Subject: [DISCUSSION] CarbonData loading solution discussion
> >
> >
> > Hi community,
> >
> > Since CarbonData has global dictionary feature, currently when loading
> > data to CarbonData, it requires two times of scan of the input data.
> First
> > scan is to generate dictionary, second scan to do actual data encoding
> and
> > write to carbon files. Obviously, this approach is simple, but this
> > approach has at least two problem:
> > 1. involve unnecessary IO read.
> > 2. need two jobs for MapReduce application to write carbon files
> >
> > To solve this, we need single-pass data loading solution, as discussed
> > earlier, and now community is developing it (CARBONDATA-401, PR310).
> >
> > In this post, I want to discuss the OutputFormat part, I think there will
> > be two OutputFormat for CarbonData.
> > 1. DictionaryOutputFormat, which is used for the global dictionary
> > generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> > 2. TableOutputFormat, which is used for writing CarbonData files.
> >
> > When carbon has these output formats, it is more easier to integrate with
> > compute framework like spark, hive, mapreduce.
> > And in order to make data loading faster, user can choose different
> > solution based on its scenario as following
> > Scenario 1:  First load is small (can not cover most dictionary)
> >
> > run two jobs that use DictionaryOutputFormat and TableOutputFormat
> > accordingly, in first few loads
> > after some loads, it becomes like Scenario 2, run one job that use
> > TableOutputFormat with single-pass
> > Scenario 2: First load is big (can cover most dictionary)
> >
> > for first load
> > if the bigest column cardinality > 10K, run two jobs using two output
> > formats
> > otherwise, run one job that use TableOutputFormat with single-pass
> > for subsequent load, run one job that use TableOutputFormat with
> > single-pass
> > What do yo think this idea?
> >
> > Regards,
> > Jacky
> >
>


?????? How to compile the latest source code of carbondata

2016-12-19 Thread 251469031
Thx liang.


I solve the problem. 
In the file carbon-spark-shell, FWDIR was set to $SPARK_HOME. I have configure 
the $SPARK_HOME in /etc/profile and the output of command "echo $SPARK_HOME" is 
correct, which mean the $SPARK_HOME has been set. 


But if I don't run the command "export $SPARK_HOME=" before running the 
command "./bin/carbon-spark-shell", the variable FWDIR can't be set. I doult 
why it is.




--  --
??: "";<251469...@qq.com>;
: 2016??12??19??(??) 4:02
??: "dev"; 

: ?? How to compile the latest source code of carbondata



I can visit spark web-ui  http://master:8080/
if there are any other environment that I should config.




--  --
??: "Liang Chen";;
: 2016??12??19??(??) 3:40
??: "dev"; 

: Re: How to compile the latest source code of carbondata



Hi

Please check your spark environment if it is ready ?



2016-12-19 15:34 GMT+08:00 251469031 <251469...@qq.com>:

> the privileges of the folder "carbondata" is:
>
>
> drwxr-xr-x 18 hadoop hadoop  4096 Dec 19 14:56 carbondata
>
>
> and hadoop is the user who run maven.
>
>
> well, after run mvn command, I get the info from console as follows:
>
>
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Apache CarbonData :: Parent  SUCCESS [
> 1.012 s]
> [INFO] Apache CarbonData :: Common  SUCCESS [
> 2.066 s]
> [INFO] Apache CarbonData :: Core .. SUCCESS [
> 5.512 s]
> [INFO] Apache CarbonData :: Processing  SUCCESS [
> 1.892 s]
> [INFO] Apache CarbonData :: Hadoop  SUCCESS [
> 0.789 s]
> [INFO] Apache CarbonData :: Spark Common .. SUCCESS [
> 17.121 s]
> [INFO] Apache CarbonData :: Spark . SUCCESS [
> 33.269 s]
> [INFO] Apache CarbonData :: Assembly .. SUCCESS [
> 17.700 s]
> [INFO] Apache CarbonData :: Spark Examples  SUCCESS [
> 7.741 s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 01:27 min
> [INFO] Finished at: 2016-12-19T14:57:26+08:00
> [INFO] Final Memory: 83M/1623M
> [INFO] 
> 
>
>
>
> but I didn't find a file names spark-submit under the path carbondata/bin/:
>
>
> [hadoop@master ~]$ cd carbondata/bin/
> [hadoop@master bin]$ ll
> total 8
> -rwxrwxr-x 1 hadoop hadoop 3879 Dec 19 14:54 carbon-spark-shell
> -rwxrwxr-x 1 hadoop hadoop 2820 Dec 19 14:54 carbon-spark-sql
>
>
>
> is this phenomenon normal ?
>
>
>
>
>
> --  --
> ??: "Liang Chen";;
> : 2016??12??19??(??) 3:19
> ??: "dev";
>
> : Re: How to compile the latest source code of carbondata
>
>
>
> Hi
>
> Please check if you have added the enough right for folder "carbondata"?
> 
> ---
> For spark 1.5,  the compile process has no issue, but  carbon-spark-shell
> can not run correctly:
> step 1: git clone https://github.com/apache/incubator-carbondata.git
>  carbondata
> step 2: mvn clean package -DskipTests -Pspark-1.5
> step 3: ./bin/carbon-spark-shell, and it turns out:
>
>
> [hadoop@master carbondata]$ ./bin/carbon-spark-shell
> ./bin/carbon-spark-shell: line 78: /bin/spark-submit: No such file or
> directory
>
> 2016-12-19 15:05 GMT+08:00 251469031 <251469...@qq.com>:
>
> > thx liang.
> >
> >
> > I've tried spark 2.0.0 and spark 1.5.0, my step & script is:
> >
> >
> > For spark 2.0, the compile process has no issue, but  carbon-spark-shell
> > can not run correctly:
> >
> >
> > step 1: git clone https://github.com/apache/incubator-carbondata.git
> > carbondata
> > step 2: mvn clean package -DskipTests -Pspark-2.0
> > step 3: ./bin/carbon-spark-shell, and is turns out:
> >
> >
> > [hadoop@master carbondata]$ ./bin/carbon-spark-shell
> > ls: cannot access /home/hadoop/carbondata/assembly/target/scala-2.10: No
> > such file or directory
> > ls: cannot access /home/hadoop/carbondata/assembly/target/scala-2.10: No
> > such file or directory
> > ./bin/carbon-spark-shell: line 78: /bin/spark-submit: No such file or
> > directory
> >
> >
> >
> > For spark 1.5,  the compile process has no issue, but  carbon-spark-shell
> > can not run correctly:
> > step 1: git clone https://github.com/apache/incubator-carbondata.git
> > carbondata
> > step 2: mvn clean package -DskipTests -Pspark-1.5
> > step 3: ./bin/carbon-spark-shell, and it turns out:
> >
> >
> > [hadoop@master carbondata]$ ./bin/carbon-spark-shell
> >