Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Saisai Shao
From the chart you pasted, I guess you only have one receiver with storage level two copies, so mostly your taks are located on two executors. You could use repartition to redistribute the data more evenly across the executors. Also add more receiver is another solution. 2015-04-30 14:38

Re: default number of reducers

2015-04-30 Thread Akhil Das
This is spark mailing list :/ Yes, you can configure the following in the mapred-site.xml for that: property namemapred.tasktracker.map.tasks.maximum/name value4/value /property Thanks Best Regards On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora shushantaror...@gmail.com wrote: In

Re: spark with standalone HBase

2015-04-30 Thread Saurabh Gupta
I am using hbase -0.94.8. On Wed, Apr 29, 2015 at 11:56 PM, Ted Yu yuzhih...@gmail.com wrote: Can you enable HBase DEBUG logging in log4j.properties so that we can have more clue ? What hbase release are you using ? Cheers On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta

Re: 1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-30 Thread Cheng Lian
Did you have a directory layout like this? base/ | data-file-2.parquet | batch_id=1/ | | data-file-1.parquet Cheng On 4/28/15 11:20 AM, sranga wrote: Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2

Best strategy for Pandas - Spark

2015-04-30 Thread Olivier Girardot
Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes. What would be

SparkContext.getCallSite is in the top of profiler by memory allocation

2015-04-30 Thread Igor Petrov
Hello, we send a lot of small jobs to Spark (up to 500 in a second). When profiling I see Throwable.getStackTrace() in the top of memory profiler which is caused by SparkContext.getCallSite - this is memory consuming. we use Java API, I tried to call SparkContext.setCallSite(-) before

Re: Performance advantage by loading data from local node over S3.

2015-04-30 Thread Akhil Das
If the data is too huge and is in S3, that'll be a lot of network traffic, instead, if the data is available in HDFS (with proper replication available) then it will be faster as most of the time, data will be available as PROCESS_LOCAL/NODE_LOCAL to the executor. Thanks Best Regards On Wed, Apr

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Lin Hao Xu
It seems that the data size is only 2.9MB, far less than the default rdd size. How about put more data into kafka? and what about the number of topic partitions from kafka? Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr:

Creating StructType with DataFrame.withColumn

2015-04-30 Thread Justin Yip
Hello, I would like to add a StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. I guess I may have

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Anshul Singhle
Hi Akhil, I discovered the reason for this problem. There was in issue with my deployment (Google Cloud Platform). My spark master was on a different region than the slaves. This resulted in huge scheduler delays. Thanks, Anshul On Thu, Apr 30, 2015 at 1:39 PM, Akhil Das

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread Akhil Das
You could try increasing your heap space explicitly. like export _JAVA_OPTIONS=-Xmx10g, its not the correct approach but try. Thanks Best Regards On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB

Re: How to install spark in spark on yarn mode

2015-04-30 Thread xiaohe lan
Hi Madhvi, If I only install spark on one node, and use spark-submit to run an application, which are the Worker nodes? Any where are the executors ? Thanks, Xiaohe On Thu, Apr 30, 2015 at 12:52 PM, madhvi madhvi.gu...@orkash.com wrote: Hi, Follow the instructions to install on the following

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Akhil Das
Does this speed up? val rdd = sc.parallelize(1 to 100*, 30)* rdd.count Thanks Best Regards On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to

Re: How to install spark in spark on yarn mode

2015-04-30 Thread madhvi
Hi, you have to specify the worker nodes of the spark cluster at the time of configurations of the cluster. Thanks Madhvi On Thursday 30 April 2015 01:30 PM, xiaohe lan wrote: Hi Madhvi, If I only install spark on one node, and use spark-submit to run an application, which are the Worker

The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Kyle Lin
Hi all My environment info Hadoop release version: HDP 2.1 Kakfa: 0.8.1.2.1.4.0 Spark: 1.1.0 My question: I ran Spark streaming program on YARN. My Spark streaming program will read data from Kafka and doing some processing. But, I found there is always only ONE executor under processing. As

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Kyle Lin
Hello Lin Hao Thanks for your reply. I will try to produce more data into Kafka. I run three Kafka borkers. Following is my topic info. Topic:kyle_test_topic PartitionCount:3 ReplicationFactor:2 Configs: Topic: kyle_test_topic Partition: 0 Leader: 3 Replicas: 3,4 Isr: 3,4 Topic: kyle_test_topic

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-30 Thread Akhil Das
Have a look at KafkaRDD https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html Thanks Best Regards On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I'm wondering about the use-case where you're not doing continuous,

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread ayan guha
1. Do a group by and get Max. In your example select id, Max(DT) from t group by id. Name this j. 2. Join t,j on id and DT=mxdt This is how we used to query RDBMS before window functions show up. As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I

Re: How to run customized Spark on EC2?

2015-04-30 Thread Akhil Das
This is how i used to do it: - Login to the ec2 cluster (master) - Make changes to the spark, and build it. - Stop the old installation of spark (sbin/stop-all.sh) - Copy old installation conf/* to modified version's conf/ - Rsync modified version to all slaves - do sbin/start-all.sh from the

Re: Starting httpd: http: Syntax error on line 154

2015-04-30 Thread Nelson Batalha
Same here (using vanilla spark-ec2), I believe this started when I moved from 1.3.0 to 1.3.1. Only thing unusual with the setup is a few extra security parameters on EC2 but was the same as in 1.3.0. Did you solve this problem? On Thu, Apr 2, 2015 at 8:02 AM, Ganon Pierce ganon.pie...@me.com

Re: How to run self-build spark on EC2?

2015-04-30 Thread Akhil Das
You can replace your clusters(on master and workers) assembly jar with your custom build assembly jar. Thanks Best Regards On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu b...@uchicago.edu wrote: Hi all, I have an issue. I added some timestamps in Spark source code and built it using: mvn package

Re: External Application Run Status

2015-04-30 Thread Akhil Das
One way you could try would be, Inside the map, you can have a synchronized thread and you can block the map till the thread finishes up processing. Thanks Best Regards On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi In a multi-node setup, I am

Enabling Event Log

2015-04-30 Thread James King
I'm unclear why I'm getting this exception. It seems to have realized that I want to enable Event Logging but ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which does exist. spark-default.conf # Example: spark.master spark://master1:7077,master2:7077

Re: Spark pre-built for Hadoop 2.6

2015-04-30 Thread Sean Owen
Yes there is now such a profile, though it is essentially redundant and doesn't configure things differently from 2.4. Besides hadoop version of course. Which is why it hadn't existed before since the 2.4 profile is 2.4+ People just kept filing bugs to add it but the docs are correct : you don't

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but

Re: local directories for spark running on yarn

2015-04-30 Thread Christophe Préaud
No, you should read: if spark.local.dir is specified, spark.local.dir will be ignored. This has been reworded (hopefully for the best) in 1.3.1: https://spark.apache.org/docs/1.3.1/running-on-yarn.html Christophe. On 17/04/2015 18:18, shenyanls wrote: According to the documentation: The

Re: spark with standalone HBase

2015-04-30 Thread Saurabh Gupta
Now able to solve the issue by setting SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local) and conf.set(“zookeeper.znode.parent”, “/hbase-unsecure”) Standalone hbase has a table 'test' hbase(main):001:0 scan 'test' ROW COLUMN+CELL row1

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Kyle Lin
Hi all Producing more data into Kafka is not effective in my situation, because the speed of reading Kafka is consistent. I will adopt Saiai's suggestion to add more receivers. Kyle 2015-04-30 14:49 GMT+08:00 Saisai Shao sai.sai.s...@gmail.com: From the chart you pasted, I guess you only

Re: Is SQLContext thread-safe?

2015-04-30 Thread Wangfei (X)
actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道: Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like

Expert advise needed. (POC is at crossroads)

2015-04-30 Thread ๏̯͡๏
I am at crossroads now and expert advise help me decide what the next course of the project going to be. Background : At out company we process tons of data to help build experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of data, takes 24 hours and does lots of joins. Its

Re: Compute pairwise distance

2015-04-30 Thread Driesprong, Fokko
Thank you guys for the input. Ayan, I am not sure how this can be done using reduceByKey, as far as I can see (but I am not so advanced with Spark), this requires a groupByKey which can be very costly. What would be nice to transform the dataset which contains all the vectors like: val

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Did not work. Same problem. On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try increasing your heap space explicitly. like export _JAVA_OPTIONS=-Xmx10g, its not the correct approach but try. Thanks Best Regards On Tue, Apr 28, 2015 at 10:35 PM,

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Ted Yu
bq. a single query on one filter criteria Can you tell us more about your filter ? How selective is it ? Which hbase release are you using ? Cheers On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, I want to use Spark as Query engine on HBase with

RE: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread Alex
482 MB should be small enough to be distributed as a set of broadcast variables. Then you can use local features of spark to process. -Original Message- From: shahab shahab.mok...@gmail.com Sent: ‎4/‎30/‎2015 9:42 AM To: user@spark.apache.org user@spark.apache.org Subject: is there

Re: SparkStream saveAsTextFiles()

2015-04-30 Thread anavidad
JavaDStream.saveAsTextFiles not exists in Spark Java Api. If you want to persist every RDD in your JavaDStream you should do something like this: words.foreachRDD(new Function2JavaRDDlt;String, Time, Void() { @Override public Void call(JavaRDDSparkFlumeEvent rddWords, Time arg1) throws

Re: spark with standalone HBase

2015-04-30 Thread Ted Yu
The error indicates incompatible protobuf versions. Please take a look at 4.1.1 under http://hbase.apache.org/book.html#basic.prerequisites Cheers On Thu, Apr 30, 2015 at 3:49 AM, Saurabh Gupta saurabh.gu...@semusi.com wrote: Now able to solve the issue by setting SparkConf sconf = *new*

Error when saving as parquet to S3 from Spark

2015-04-30 Thread Cosmin Cătălin Sanda
After repartitioning a *DataFrame* in *Spark 1.3.0* I get a *.parquet* exception when saving to*Amazon's S3*. The data that I try to write is 10G. logsForDate .repartition(10) .saveAsParquetFile(destination) // -- Exception here The exception I receive is: java.io.IOException: The file

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread Wang, Ningjun (LNG-NPV)
I think this will be slow because you have to do a group by then do a join (my table has 7 million records). I am looking for something like reduceByKey(), e.g. rdd.reduceByKey( (a, b) = if (a.timeStamp b.timeStamp) a else b ) Does it have similar thing in DataFrame? Of course I can convert a

is there anyway to enforce Spark to cache data in all worker nodes (almost equally) ?

2015-04-30 Thread shahab
Hi, I load data from Cassandra into spark The entire data is almost around 482 MB. and it is cached as TempTable in 7 tables. How can I enforce spark to cache data in both worker nodes not only in ONE worker (as in my case)? I am using spark 2.1.1 with spark-connector 1.2.0-rc3. I have small

RE: HBase HTable constructor hangs

2015-04-30 Thread Tridib Samanta
You are right. After I moved from HBase 0.98.1 to 1.0.0 this problem got solved. Thanks all! Date: Wed, 29 Apr 2015 06:58:59 -0700 Subject: Re: HBase HTable constructor hangs From: yuzhih...@gmail.com To: tridib.sama...@live.com CC: d...@ocirs.com; user@spark.apache.org Can you verify whether

Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang

Re: JavaRDDListTuple2 flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Dan DeCapria, CivicScience
Thought about it some more, and simplified the problem space for discussions: Given: JavaPairRDDString, Integer c1; // c1.count() == 8000. Goal: JavaPairRDDTuple2String,Integer,Tuple2String,Integer c2; // all lexicographical pairs Where: all lexicographic permutations on c1 ::

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
The data ingestion is in outermost portion in foreachRDD block. Although now I close the statement of jdbc, the same exception happened again. It seems it is not related to the data ingestion part. On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger c...@koeninger.org wrote: Use lsof to see what

Re: Timeout Error

2015-04-30 Thread ๏̯͡๏
I am facing same issue, do you have any solution ? On Mon, Apr 27, 2015 at 9:43 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I dug a little deeper and found this error : 15/04/27 16:05:39 WARN TransportChannelHandler: Exception in connection from /10.1.0.90:40590

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Full Exception *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at VISummaryDataProvider.scala:37) failed in 884.087 s* *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap at VISummaryDataProvider.scala:37, took 1093.418249 s* 15/04/30 09:59:49 ERROR

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Cody Koeninger
Did you use lsof to see what files were opened during the job? On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: The data ingestion is in outermost portion in foreachRDD block. Although now I close the statement of jdbc, the same exception happened again. It seems it

sap hana database laod using jdbcRDD

2015-04-30 Thread Hafiz Mujadid
Hi ! Can we load hana database table using spark jdbc RDD? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sap-hana-database-laod-using-jdbcRDD-tp22726.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: JavaRDDListTuple2 flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Sean Owen
You fundamentally want (half of) the Cartesian product so I don't think it gets a lot faster to form this. You could implement this on cogroup directly and maybe avoid forming the tuples you will filter out. I'd think more about whether you really need to do this thing, or whether there is

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
I terminated the old job and now start a new one. Currently, the Spark streaming job has been running for 2 hours and when I use lsof, I do not see many files related to the Spark job. BTW, I am running Spark streaming using local[2] mode. The batch is 5 seconds and it has around 50 lines to read

How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94,

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread ayan guha
And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:

Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Zhan Zhang
One optimization is to reduce the shuffle by first aggregate locally (only keep the max for each name), and then reduceByKey. Thanks. Zhan Zhang On Apr 24, 2015, at 10:03 PM, ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com wrote: Here you go t =

Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Wenlei Xie
Hi Zhan, How would this be achieved? Should the data be partitioned by name in this case? Thank you! Best, Wenlei On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang zzh...@hortonworks.com wrote: One optimization is to reduce the shuffle by first aggregate locally (only keep the max for each

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread ๏̯͡๏
1) As i am limited with 12G and i was doing a brodcast join (collect data and then publish), it was throwing OOM. The data size was 25G and limit was 12G, hence i reverted back to regular join. 2) I started using repartitioning, i started with 100 and now trying 200. At beginning it looked

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Michael Armbrust
This looks like a bug. Mind opening a JIRA? On Thu, Apr 30, 2015 at 3:49 PM, Justin Yip yipjus...@prediction.io wrote: After some trial and error, using DataType solves the problem: df.withColumn(millis, $eventTime.cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr

Re: DataFrame filter referencing error

2015-04-30 Thread Burak Yavuz
Is new a reserved word for MySQL? On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella francesco.bigare...@gmail.com wrote: Do you know how I can check that? I googled a bit but couldn't find a clear explanation about it. I also tried to use explain() but it doesn't really help. I still

RE: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread java8964
Really not expert here, but try the following ideas: 1) I assume you are using yarn, then this blog is very good about the resource tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 2) If 12G is a hard limit in this case, then you have no option but lower

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread Sandy Ryza
Hi Deepak, I wrote a couple posts with a bunch of different information about how to tune Spark jobs. The second one might be helpful with how to think about tuning the number of partitions and resources? What kind of OOMEs are you hitting?

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem: df.withColumn(millis, $eventTime.cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I was able to cast a timestamp into long using

Re: How to install spark in spark on yarn mode

2015-04-30 Thread Shixiong Zhu
You don't need to install Spark. Just download or build a Spark package that matches your Yarn version. And ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. See instructions here:

casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
Hello, I was able to cast a timestamp into long using df.withColumn(millis, $eventTime.cast(long) * 1000) in spark 1.3.0. However, this statement returns a failure with spark 1.3.1. I got the following exception: Exception in thread main org.apache.spark.sql.types.DataTypeException: Unsupported

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-04-30 Thread Todd Nist
I’m very perplexed with the following. I have a set of AVRO generated objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming job follows the receiver-based approach. I am encountering the below error when I attempt to de serialize the payload: 15/04/30 17:49:25 INFO

Re: Enabling Event Log

2015-04-30 Thread Shixiong Zhu
spark.history.fs.logDirectory is for the history server. For Spark applications, they should use spark.eventLog.dir. Since you commented out spark.eventLog.dir, it will be /tmp/spark-events. And this folder does not exits. Best Regards, Shixiong Zhu 2015-04-29 23:22 GMT-07:00 James King

Re: DataFrame filter referencing error

2015-04-30 Thread Francesco Bigarella
Do you know how I can check that? I googled a bit but couldn't find a clear explanation about it. I also tried to use explain() but it doesn't really help. I still find unusual that I have this issue only for the equality operator but not for the others. Thank you, F On Wed, Apr 29, 2015 at 3:03

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Corey Nolet
A tad off topic, but could still be relevant. Accumulo's design is a tad different in the realm of being able to shard and perform set intersections/unions server-side (through seeks). I've got an adapter for Spark SQL on top of a document store implementation in Accumulo that accepts the

Re: DataFrame filter referencing error

2015-04-30 Thread ayan guha
I think you need to specify new in single quote. My guess is the query showing up in dB is like ...where status=new or ...where status=new Either case mysql assumes new is a column. What you need is the form below ...where status='new' You need to provide your quotes accordingly. Easiest way

how to pass configuration properties from driver to executor?

2015-04-30 Thread Tian Zhang
Hi, We have a scenario as below and would like your suggestion. We have app.conf file with propX=A as default built into the fat jar file that is provided to spark-submit WE have env.conf file with propX=B that would like spark-submit to take as input to overwrite the default and populate to

JavaRDDListTuple2 flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Dan DeCapria, CivicScience
I am trying to generate all (N-1)(N)/2 lexicographical 2-tuples from a glom() JavaPairRDDListTuple2. The construction of these initial Tuple2's JavaPairRDDAQ,Integer space is well formed from case classes I provide it (AQ, AQV, AQQ, CT) and is performant; minimized code: SparkConf conf = new

[Runing Spark Applications with Chronos or Marathon]

2015-04-30 Thread Aram Mkrtchyan
Hi, We want to have Marathon starting and monitoring Chronos, so that when Chronos based Spark job fails, marathon automatically restarts them in scope of Chronos. Will this approach work if we start Spark jobs as shell scripts from Chronos or Marathon?

real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Siddharth Ubale
Hi, I want to use Spark as Query engine on HBase with sub second latency. I am using Spark 1.3 version. And followed the steps below on Hbase table with around 3.5 lac rows : 1. Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase table which is used to create the

RE: External Application Run Status

2015-04-30 Thread Nastooh Avessta (navesta)
Thanks. Sounds reasonable, perhaps blocking on local fifo or a socket would do. Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123

Error when saving as parquet to S3

2015-04-30 Thread cosmincatalin
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception when saving toAmazon's S3. The data that I try to write is 10G. logsForDate .repartition(10) .saveAsParquetFile(destination) // -- Exception here The exception I receive is: java.io.IOException: The file being

Re: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread shahab
Thanks Alex, but 482MB was just example size, and I am looking for generic approach doing this without broadcasting, any idea? best, /Shahab On Thu, Apr 30, 2015 at 4:21 PM, Alex lxv...@gmail.com wrote: 482 MB should be small enough to be distributed as a set of broadcast variables. Then

Re: spark with standalone HBase

2015-04-30 Thread Akshat Aranya
Looking at your classpath, it looks like you've compiled Spark yourself. Depending on which version of Hadoop you've compiled against (looks like it's Hadoop 2.2 in your case), Spark will have its own version of protobuf. You should try by making sure both your HBase and Spark are compiled