[GitHub] incubator-carbondata-site issue #2: Home Page changes
Github user sgururajshetty commented on the issue: https://github.com/apache/incubator-carbondata-site/pull/2 @chenliang613 please review the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-545) Carbon Query GC Problem
kumar vishal created CARBONDATA-545: --- Summary: Carbon Query GC Problem Key: CARBONDATA-545 URL: https://issues.apache.org/jira/browse/CARBONDATA-545 Project: CarbonData Issue Type: Improvement Components: data-query Reporter: kumar vishal Assignee: kumar vishal Fix For: 1.0.0-incubating Problem There are lots of gc when carbon is processing more number of records during query, which is impacting carbon query performance.To solve this gc problem happening when query output is too huge or when more number of records are processed. Solution Currently we are storing all the data which is read during query from carbon data file in heap, when number of query output is huge it is causing more gc. Instead of storing in heap we can store this data in offheap and will clear when scanning is finished for that query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [Improvement] Carbon query gc problem
+1, I have suffered from gc problem. As I understand, the BatchResult will be cached and continue to be kept in memory for a little long term, which cause a lot of data be moved from Young to Old. It is better to move it to off-heap. 2016-12-20 11:57 GMT+08:00 ZhuWilliam: > +1 Heap should not store data ,it should be used to store runtime temp > data. > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Improvement- > Carbon-query-gc-problem-tp4322p4718.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. >
Re: [Improvement] Carbon query gc problem
+1 Heap should not store data ,it should be used to store runtime temp data. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Carbon-query-gc-problem-tp4322p4718.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: [Improvement] Carbon query gc problem
Hi+1,Store data in offheap to avoid gc problem , the solution will help performance more. Kumar Vishal wrote > There are lots of gc when carbon is processing more number of > recordsduring query, which is impacting carbon query performance.To solve > this gcproblem happening when query output is too huge or when more number > ofrecords are processed, I would like to propose below solution.Currently > we are storing all the data which is read during query fromcarbon data > file in heap, when number of query output is huge it is causingmore gc. > Instead of storing in heap we can store this data in offheap andwill clear > when scanning is finished for that query.Please vote and comment for above > proposal.-RegardsKUmar Vishal -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Carbon-query-gc-problem-tp4322p4717.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
?????? InvalidInputException when loading data to table
OK, thx~ It's a local path, well, in the error log, it shows that the dataFilePath is set to /home/hadoop/carbondata/sample.csv, and it is where my test file located. @see the log: Input path does not exist: /home/hadoop/carbondata/sample.csv in the following command, is the package of class File is java.io.File? scala>val dataFilePath = new File("../carbondata/sample.csv").getCanonicalPath -- -- ??: "Liang Chen";; : 2016??12??20??(??) 8:35 ??: "dev" ; : Re: InvalidInputException when loading data to table Hi 1.Your input path is hadoop or local ? Please double check your input path if it is correct. 2.As a new starter, suggest you use IntellijIDEA to open CarbonData , and run all examples. Regards Liang -- org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: /home/hadoop/carbondata/sample.csv at org.apache.hadoop.mapreduce.lib.input.FileInputFormat. listStatus(FileInputFormat.java:285) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat. getSplits(FileInputFormat.java:340) at org.apache.spark.rdd.NewHadoopRDD.getPartitions( NewHadoopRDD.scala:113) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 2016-12-19 20:24 GMT+08:00 251469031 <251469...@qq.com>: > Hi all, > > I'm now learning how to getting started with carbondata according to > the tutorial: https://cwiki.apache.org/confluence/display/CARBONDATA/ > Quick+Start. > > > I created a file named sample.csv under the path > /home/hadoop/carbondata at the master node, and when I run the script: > > > scala>val dataFilePath = new File("../carbondata/sample. > csv").getCanonicalPath > scala>cc.sql(s"load data inpath '$dataFilePath' into table test_table") > > > it turns out a "InvalidInputException" while the file is acctually exist, > here is the scripts and logs: > > > scala> val dataFilePath = new File("../carbondata/sample. > csv").getCanonicalPath > dataFilePath: String = /home/hadoop/carbondata/sample.csv > > > scala> cc.sql(s"load data inpath '$dataFilePath' into table test_table") > INFO 19-12 20:18:22,991 - main Query [LOAD DATA INPATH > '/HOME/HADOOP/CARBONDATA/SAMPLE.CSV' INTO TABLE TEST_TABLE] > INFO 19-12 20:18:23,271 - Successfully able to get the table metadata > file lock > INFO 19-12 20:18:23,276 - main Initiating Direct Load for the Table : > (default.test_table) > INFO 19-12 20:18:23,279 - main Generate global dictionary from source > data files! > INFO 19-12 20:18:23,296 - main [Block Distribution] > INFO 19-12 20:18:23,297 - main totalInputSpaceConsumed: 74 , > defaultParallelism: 28 > INFO 19-12 20:18:23,297 - main mapreduce.input.fileinputformat.split.maxsize: > 16777216 > INFO 19-12 20:18:23,380 - Block broadcast_0 stored as values in memory > (estimated size 137.1 KB, free 137.1 KB) > INFO 19-12 20:18:23,397 - Block broadcast_0_piece0 stored as bytes in > memory (estimated size 15.0 KB, free 152.1 KB) > INFO 19-12 20:18:23,398 - Added broadcast_0_piece0 in memory on > 172.17.195.12:46335 (size: 15.0 KB, free: 511.1 MB) > INFO 19-12 20:18:23,399 - Created broadcast 0 from NewHadoopRDD at > CarbonTextFile.scala:73 > ERROR 19-12 20:18:23,431 - main generate global dictionary failed > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: /home/hadoop/carbondata/sample.csv > at org.apache.hadoop.mapreduce.lib.input.FileInputFormat. > listStatus(FileInputFormat.java:285) > at org.apache.hadoop.mapreduce.lib.input.FileInputFormat. > getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions( > NewHadoopRDD.scala:113) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > ... > > > If any of you have met the same problem, would you tell me why this > happen, looking forward to your replay, thx~ -- Regards Liang
Re: [DISCUSSION] CarbonData loading solution discussion
+1 Now user will have flexibility to choose the output format.Will get performance benefit if dictionary files are already generated. -Regards Kumar Vishal On Fri, Dec 16, 2016 at 10:19 AM, Ravindra Pesalawrote: > +1 to have separate output formats, now user can have flexibility to choose > as per scenario. > > On Fri, Dec 16, 2016, 2:47 AM Jihong Ma wrote: > > > > > It is great idea to have separate OutputFormat for regular Carbon data > > files, index files as well as meta data files, For instance: dictionary > > file, schema file, global index file etc.. for writing Carbon generated > > files laid out HDFS, and it is orthogonal to the actual data load > process. > > > > Regards. > > > > Jihong > > > > -Original Message- > > From: Jacky Li [mailto:jacky.li...@qq.com] > > Sent: Thursday, December 15, 2016 12:55 AM > > To: dev@carbondata.incubator.apache.org > > Subject: [DISCUSSION] CarbonData loading solution discussion > > > > > > Hi community, > > > > Since CarbonData has global dictionary feature, currently when loading > > data to CarbonData, it requires two times of scan of the input data. > First > > scan is to generate dictionary, second scan to do actual data encoding > and > > write to carbon files. Obviously, this approach is simple, but this > > approach has at least two problem: > > 1. involve unnecessary IO read. > > 2. need two jobs for MapReduce application to write carbon files > > > > To solve this, we need single-pass data loading solution, as discussed > > earlier, and now community is developing it (CARBONDATA-401, PR310). > > > > In this post, I want to discuss the OutputFormat part, I think there will > > be two OutputFormat for CarbonData. > > 1. DictionaryOutputFormat, which is used for the global dictionary > > generation. (This should be extracted from CarbonColumnDictGeneratRDD) > > 2. TableOutputFormat, which is used for writing CarbonData files. > > > > When carbon has these output formats, it is more easier to integrate with > > compute framework like spark, hive, mapreduce. > > And in order to make data loading faster, user can choose different > > solution based on its scenario as following > > Scenario 1: First load is small (can not cover most dictionary) > > > > run two jobs that use DictionaryOutputFormat and TableOutputFormat > > accordingly, in first few loads > > after some loads, it becomes like Scenario 2, run one job that use > > TableOutputFormat with single-pass > > Scenario 2: First load is big (can cover most dictionary) > > > > for first load > > if the bigest column cardinality > 10K, run two jobs using two output > > formats > > otherwise, run one job that use TableOutputFormat with single-pass > > for subsequent load, run one job that use TableOutputFormat with > > single-pass > > What do yo think this idea? > > > > Regards, > > Jacky > > >
?????? How to compile the latest source code of carbondata
Thx liang. I solve the problem. In the file carbon-spark-shell, FWDIR was set to $SPARK_HOME. I have configure the $SPARK_HOME in /etc/profile and the output of command "echo $SPARK_HOME" is correct, which mean the $SPARK_HOME has been set. But if I don't run the command "export $SPARK_HOME=" before running the command "./bin/carbon-spark-shell", the variable FWDIR can't be set. I doult why it is. -- -- ??: "";<251469...@qq.com>; : 2016??12??19??(??) 4:02 ??: "dev"; : ?? How to compile the latest source code of carbondata I can visit spark web-ui http://master:8080/ if there are any other environment that I should config. -- -- ??: "Liang Chen"; ; : 2016??12??19??(??) 3:40 ??: "dev" ; : Re: How to compile the latest source code of carbondata Hi Please check your spark environment if it is ready ? 2016-12-19 15:34 GMT+08:00 251469031 <251469...@qq.com>: > the privileges of the folder "carbondata" is: > > > drwxr-xr-x 18 hadoop hadoop 4096 Dec 19 14:56 carbondata > > > and hadoop is the user who run maven. > > > well, after run mvn command, I get the info from console as follows: > > > [INFO] Reactor Summary: > [INFO] > [INFO] Apache CarbonData :: Parent SUCCESS [ > 1.012 s] > [INFO] Apache CarbonData :: Common SUCCESS [ > 2.066 s] > [INFO] Apache CarbonData :: Core .. SUCCESS [ > 5.512 s] > [INFO] Apache CarbonData :: Processing SUCCESS [ > 1.892 s] > [INFO] Apache CarbonData :: Hadoop SUCCESS [ > 0.789 s] > [INFO] Apache CarbonData :: Spark Common .. SUCCESS [ > 17.121 s] > [INFO] Apache CarbonData :: Spark . SUCCESS [ > 33.269 s] > [INFO] Apache CarbonData :: Assembly .. SUCCESS [ > 17.700 s] > [INFO] Apache CarbonData :: Spark Examples SUCCESS [ > 7.741 s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 01:27 min > [INFO] Finished at: 2016-12-19T14:57:26+08:00 > [INFO] Final Memory: 83M/1623M > [INFO] > > > > > but I didn't find a file names spark-submit under the path carbondata/bin/: > > > [hadoop@master ~]$ cd carbondata/bin/ > [hadoop@master bin]$ ll > total 8 > -rwxrwxr-x 1 hadoop hadoop 3879 Dec 19 14:54 carbon-spark-shell > -rwxrwxr-x 1 hadoop hadoop 2820 Dec 19 14:54 carbon-spark-sql > > > > is this phenomenon normal ? > > > > > > -- -- > ??: "Liang Chen"; ; > : 2016??12??19??(??) 3:19 > ??: "dev" ; > > : Re: How to compile the latest source code of carbondata > > > > Hi > > Please check if you have added the enough right for folder "carbondata"? > > --- > For spark 1.5, the compile process has no issue, but carbon-spark-shell > can not run correctly: > step 1: git clone https://github.com/apache/incubator-carbondata.git > carbondata > step 2: mvn clean package -DskipTests -Pspark-1.5 > step 3: ./bin/carbon-spark-shell, and it turns out: > > > [hadoop@master carbondata]$ ./bin/carbon-spark-shell > ./bin/carbon-spark-shell: line 78: /bin/spark-submit: No such file or > directory > > 2016-12-19 15:05 GMT+08:00 251469031 <251469...@qq.com>: > > > thx liang. > > > > > > I've tried spark 2.0.0 and spark 1.5.0, my step & script is: > > > > > > For spark 2.0, the compile process has no issue, but carbon-spark-shell > > can not run correctly: > > > > > > step 1: git clone https://github.com/apache/incubator-carbondata.git > > carbondata > > step 2: mvn clean package -DskipTests -Pspark-2.0 > > step 3: ./bin/carbon-spark-shell, and is turns out: > > > > > > [hadoop@master carbondata]$ ./bin/carbon-spark-shell > > ls: cannot access /home/hadoop/carbondata/assembly/target/scala-2.10: No > > such file or directory > > ls: cannot access /home/hadoop/carbondata/assembly/target/scala-2.10: No > > such file or directory > > ./bin/carbon-spark-shell: line 78: /bin/spark-submit: No such file or > > directory > > > > > > > > For spark 1.5, the compile process has no issue, but carbon-spark-shell > > can not run correctly: > > step 1: git clone https://github.com/apache/incubator-carbondata.git > > carbondata > > step 2: mvn clean package -DskipTests -Pspark-1.5 > > step 3: ./bin/carbon-spark-shell, and it turns out: > > > > > > [hadoop@master carbondata]$ ./bin/carbon-spark-shell > >