Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park piaozhe...@gmail.com wrote: Hi, I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to confirm whether these are bugs or not before opening a jira. *1)*

how to distributed run a bash shell in spark

2015-05-24 Thread luohui20001
hello there I am trying to run a app in which part of it needs to run a shell.how to run a shell distributed in spark cluster.thanks. here's my code:import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import

Spark dramatically slow when I add saveAsTextFile

2015-05-24 Thread allanjie
*Problem Description*: The program running in stand-alone spark cluster (1 master, 6 workers with 8g ram and 2 cores). Input: a 468MB file with 133433 records stored in HDFS. Output: just 2MB file will stored in HDFS The program has two map operations and one reduceByKey operation. Finally I

Re: Strange ClassNotFound exeption

2015-05-24 Thread Ted Yu
Can you pastebin the class path ? Thanks On May 24, 2015, at 5:02 AM, boci boci.b...@gmail.com wrote: Yeah, I have same jar with same result, I run in docker container and I using same docker container with my another project... the only difference is the postgresql jdbc driver and the

Re: Strange ClassNotFound exeption

2015-05-24 Thread boci
Yeah, I have same jar with same result, I run in docker container and I using same docker container with my another project... the only difference is the postgresql jdbc driver and the custom RDD... no additional dependencies (both single jar generated with same assembly configuration with same

Re: Spark Streaming - Design considerations/Knobs

2015-05-24 Thread Maiti, Samya
Really good list to brush up basics. Just one input, regarding * An RDD's processing is scheduled by driver's jobscheduler as a job. At a given point of time only one job is active. So, if one job is executing the other jobs are queued. We can have multiple jobs running in a given

Re: Spark dramatically slow when I add saveAsTextFile

2015-05-24 Thread Joe Wass
This may sound like an obvious question, but are you sure that the program is doing any work when you don't have a saveAsTextFile? If there are transformations but no actions to actually collect the data, there's no need for Spark to execute the transformations. As to the question of 'is this

Re: how to distributed run a bash shell in spark

2015-05-24 Thread Akhil Das
You mean you want to execute some shell commands from spark? Here's something i tried a while back. https://github.com/akhld/spark-exploit Thanks Best Regards On Sun, May 24, 2015 at 4:53 PM, luohui20...@sina.com wrote: hello there I am trying to run a app in which part of it needs to

Re: Trying to connect to many topics with several DirectConnect

2015-05-24 Thread Akhil Das
I used to hit a NPE when i don't add all the dependency jars to my context while running it in standalone mode. Can you try adding all these dependencies to your context? sc.addJar(/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar)

Help optimizing some spark code

2015-05-24 Thread Tal
Hi, I'm running this piece of code in my program: smallRdd.join(largeRdd) .groupBy { case (id, (_, X(a, _, _))) = a } .map { case (a, iterable) = a- iterable.size } .sortBy({ case (_, count) = count }, ascending = false) .take(k) where basically smallRdd is an rdd

Powered by Spark listing

2015-05-24 Thread Michael Roberts
Information Innovators, Inc. http://www.iiinfo.com/ Spark, Spark Streaming, Spark SQL, MLLib Developing data analytics systems for federal healthcare, national defense and other programs using Spark on YARN. -- This page tracks the users of Spark. To add yourself to the list, please email

Re: How to use zookeeper in Spark Streaming

2015-05-24 Thread Ted Yu
I think the Zookeeper watcher code should reside in task code. Haven't found guide on this subject so far. Cheers On Sun, May 24, 2015 at 7:15 PM, bit1...@163.com bit1...@163.com wrote: Can someone please help me on this? -- bit1...@163.com *发件人:*

回复:Re: how to distributed run a bash shell in spark

2015-05-24 Thread luohui20001
Thanks Akhil, your code is a big help to me,'cause perl script is the exactly thing i wanna try to run in spark. I will have a try. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉

Using Spark like a search engine

2015-05-24 Thread Сергей Мелехин
HI! We are developing scoring system for recruitment. Recruiter enters vacancy requirements, and we score tens of thousands of CVs to this requirements, and return e.g. top 10 matches. We do not use fulltext search and sometimes even dont filter input CVs prior to scoring (some vacancies do not

回复: How to use zookeeper in Spark Streaming

2015-05-24 Thread bit1...@163.com
Can someone please help me on this? bit1...@163.com 发件人: bit1...@163.com 发送时间: 2015-05-24 13:53 收件人: user 主题: How to use zookeeper in Spark Streaming Hi, In my spark streaming application, when the application starts and get running, the Tasks running on the Worker nodes need to be

RE: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheng, Hao
Thanks for reporting this. We intend to support the multiple metastore versions in a single build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re hitting the bug, please file a jira issue for this. I will keep investigating on this also. Hao From: Mark Hamstra

Re: Spark Streaming - Design considerations/Knobs

2015-05-24 Thread Tathagata Das
Blocks are replicated immediately, before the driver launches any jobs using them. On Thu, May 21, 2015 at 2:05 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for reading and replying. However, I have a follow-up