Cassandra time series + Spark

2015-03-23 Thread Rumph, Frens Jan
Hi, I'm working on a system which has to deal with time series data. I've been happy using Cassandra for time series and Spark looks promising as a computational platform. I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as kairosdb does it. This allows an 8 byte chunk star

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-23 Thread Xi Shen
I did not build my own Spark. I got the binary version online. If it can load the native libs from IDE, it should also be able to load native when running with "--matter local". On Mon, 23 Mar 2015 07:15 Burak Yavuz wrote: > Did you build Spark with: -Pnetlib-lgpl? > > Ref: https://spark.apache.

log files of failed task

2015-03-23 Thread sergunok
Hi, I executed a task on Spark in YARN and it failed. I see just "executor lost" message from YARNClientScheduler, no further details.. (I read ths error can be connected to spark.yarn.executor.memoryOverhead setting and already played with this param) How to go more deeply in details in log file

Re: updateStateByKey performance & API

2015-03-23 Thread Andre Schumacher
Hi Nikos, We experienced something similar in our setting where the Spark app was supposed to write to a Redis instance the final state changes. Over time the delay caused by re-writing the entire dataset in each iteration exceeded the Spark streaming batch size. In our cased the solution was to

Spark UI tunneling

2015-03-23 Thread sergunok
Is it a way to tunnel Spark UI? I tried to tunnel client-node:4040 but my browser was redirected from localhost to some cluster locally visible domain name.. Maybe there is some startup option to encourage Spark UI be fully accessiable just through single endpoint (address:port)? Serg. -- Vi

Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql job with python udf i got a exception: java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) at org.apache.spark.sql.catalyst.expressions.B

Re: Spark UI tunneling

2015-03-23 Thread Akhil Das
Did you try ssh -L 4040:127.0.0.1:4040 user@host Thanks Best Regards On Mon, Mar 23, 2015 at 1:12 PM, sergunok wrote: > Is it a way to tunnel Spark UI? > > I tried to tunnel client-node:4040 but my browser was redirected from > localhost to some cluster locally visible domain name.. > > Maybe

Re: log files of failed task

2015-03-23 Thread Emre Sevinc
Hello Sergun, Generally you can use yarn application -list to see the s of applications and then you can see the logs of finished applications using: yarn logs -applicationId Hope this helps. -- Emre Sevinç http://www.bigindustries.be/ On Mon, Mar 23, 2015 at 8:23 AM, sergunok wrot

Re: Spark streaming alerting

2015-03-23 Thread Jeffrey Jedele
What exactly do you mean by "alerts"? Something specific to your data or general events of the spark cluster? For the first, sth like Akhil suggested should work. For the latter, I would suggest having a log consolidation system like logstash in place and use this to generate alerts. Regards, Jef

Re: Spark Sql with python udf fail

2015-03-23 Thread Cheng Lian
Could you elaborate on the UDF code? On 3/23/15 3:43 PM, lonely Feb wrote: Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql job with python udf i got a exception: java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.spark.sql.catalyst.expressions.Generi

Re: Spark UI tunneling

2015-03-23 Thread Sergey Gerasimov
Akhil, that's what I did. The problem is that probably web server tried to forward my request to another address accessible locally only. > 23 марта 2015 г., в 11:12, Akhil Das написал(а): > > Did you try ssh -L 4040:127.0.0.1:4040 user@host > > Thanks > Best Regards > >> On Mon, Mar 23,

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
I caught exceptions in the python UDF code, flush exceptions into a single file, and made sure the the column number of the output lines as same as sql schema. Sth. interesting is that my output line of the UDF code is just 10 columns, and the exception above is java.lang.ArrayIndexOutOfBoundsExce

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-23 Thread Night Wolf
Was a solution ever found for this. Trying to run some test cases with sbt test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get this error. Setting fork := true in sbt seems to work but its a less than idea work around. On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles wrote:

Re: Spark UI tunneling

2015-03-23 Thread Akhil Das
Oh in that case you could try adding the hostname in your /etc/hosts under your localhost. Also make sure there is a request going to another host by inspecting the network calls: [image: Inline image 1] Thanks Best Regards On Mon, Mar 23, 2015 at 1:55 PM, Sergey Gerasimov wrote: > Akhil, > >

Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dears, Is there any way to validate the CSV, Json ... Files while loading to DataFrame. I need to ignore corrupted rows.(Rows with not matching with the schema). Thanks, Ahmed Nawwar

Re: Spark Sql with python udf fail

2015-03-23 Thread Cheng Lian
I suspect there is a malformed row in your input dataset. Could you try something like this to confirm: |sql("SELECT * FROM ").foreach(println) | If there does exist a malformed line, you should see similar exception. And you can catch it with the help of the output. Notice that the messages

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
ok i'll try asap 2015-03-23 17:00 GMT+08:00 Cheng Lian : > I suspect there is a malformed row in your input dataset. Could you try > something like this to confirm: > > sql("SELECT * FROM ").foreach(println) > > If there does exist a malformed line, you should see similar exception. > And you ca

Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread zzcclp
My test env:1. Spark version is 1.3.02. 3 node per 80G/20C3. read 250G parquet files from hdfs Test case:1. register "floor" func with command: *sqlContext.udf.register("floor", (ts: Int) => ts - ts % 300), *then run with sql "select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan, fl

Re: Data/File structure Validation

2015-03-23 Thread Taotao.Li
can it load successfully if the format is invalid? - 原始邮件 - 发件人: "Ahmed Nawar" 收件人: user@spark.apache.org 发送时间: 星期一, 2015年 3 月 23日 下午 4:48:54 主题: Data/File structure Validation Dears, Is there any way to validate the CSV, Json ... Files while loading to DataFrame. I need to ign

Re: Spark streaming alerting

2015-03-23 Thread Khanderao Kand Gmail
Akhil You are right in tour answer to what Mohit wrote. However what Mohit seems to be alluring but did not write properly might be different. Mohit You are wrong in saying "generally" streaming works in HDFS and cassandra . Streaming typically works with streaming or queing source like Kafka

Re: Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dear Taotao, Yes, I tried sparkCSV. Thanks, Nawwar On Mon, Mar 23, 2015 at 12:20 PM, Taotao.Li wrote: > can it load successfully if the format is invalid? > > -- > *发件人: *"Ahmed Nawar" > *收件人: *user@spark.apache.org > *发送时间: *星期一, 2015年 3 月 23日 下午 4:48:54 > *

Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, all Can spark use pig's load function to load data? Best Regards, Kevin.

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
sql("SELECT * FROM ").foreach(println) can be executed successfully. So the problem may still be in UDF code. How can i print the the line with ArrayIndexOutOfBoundsException in catalyst? 2015-03-23 17:04 GMT+08:00 lonely Feb : > ok i'll try asap > > 2015-03-23 17:00 GMT+08:00 Cheng Lian : > >>

Re: How Does aggregate work

2015-03-23 Thread Paweł Szulc
It is actually number of cores. If your processor has hyperthreading then it will be more (number of processors your OS sees) niedz., 22 mar 2015, 4:51 PM Ted Yu użytkownik napisał: > I assume spark.default.parallelism is 4 in the VM Ashish was using. > > Cheers >

Re: Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dear Raunak, Source system provided logs with some errors. I need to make sure each row is in correct format (number of columns/ attributes and data types is correct) and move incorrect Rows to separated List. Of course i can do my logic but i need to make sure there is no direct way. Thanks

Re: How to handle under-performing nodes in the cluster

2015-03-23 Thread Sean Owen
Why is it under-performing? this just says it executed fewer tasks, which could be because of data locality, etc. Better is to look at whether stages are slowed down by straggler tasks, and if so, whether they come from one machine, and if so what may be different about that one. On Fri, Mar 20, 2

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Sean Owen
Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably mak

registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread IT CTO
Hi, I am running spark when I use sc.version I get 1.2 but when I call registerTempTable("MyTable") I get error saying registedTempTable is not a member of RDD Why? -- Eran | CTO

Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Emre Sevinc
Hello, According to Spark Documentation at https://spark.apache.org/docs/1.2.1/submitting-applications.html : --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). And indeed, when I use that parameter, in my S

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Dean Wampler
In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through the "type class" mechanism) when you add a SQLContext, like so. val sqlContext = new SQLContext(sc)import sqlContext._ In 1.3, the method has moved to the new DataFrame type. Dean Wampler, Ph.D. Author: Programming Scala

[no subject]

2015-03-23 Thread Udbhav Agarwal
Hi, I am querying hbase via Spark SQL with java APIs. Step -1 creating JavaPairRdd, then JavaRdd, then JavaSchemaRdd.applySchema objects. Step -2 sqlContext.sql(sql query). If am updating my hbase database between these two steps(by hbase shell in some other console) the query in step two is not

Re: Saving Dstream into a single file

2015-03-23 Thread Dean Wampler
You can use the coalesce method to reduce the number of partitions. You can reduce to one if the data is not too big. Then write the output. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @dean

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread IT CTO
Thanks. I am new to the environment and running cloudera CDH5.3 with spark in it. apparently when running in spark-shell this command val sqlContext = new SQLContext(sc) I am failing with the not found type SQLContext Any idea why? On Mon, Mar 23, 2015 at 3:05 PM, Dean Wampler wrote: > In 1.2

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Ted Yu
Have you tried adding the following ? import org.apache.spark.sql.SQLContext Cheers On Mon, Mar 23, 2015 at 6:45 AM, IT CTO wrote: > Thanks. > I am new to the environment and running cloudera CDH5.3 with spark in it. > > apparently when running in spark-shell this command val sqlContext = new

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread IT CTO
Yes! Any reason this happen in my environment and not in any sample code I found? Should I fix something in the path or env? Eran On Mon, Mar 23, 2015 at 3:50 PM, Ted Yu wrote: > Have you tried adding the following ? > > import org.apache.spark.sql.SQLContext > > Cheers > > On Mon, Mar 23, 201

RE: Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread Cheng, Hao
This is a very interesting issue, the root reason for the lower performance probably is, in Scala UDF, Spark SQL converts the data type from internal representation to Scala representation via Scala reflection recursively. Can you create a Jira issue for tracking this? I can start to work on the

Spark RDD mapped to Hbase to be updateable

2015-03-23 Thread Siddharth Ubale
Hi, We have a JavaRDD mapped to a hbase table and when we query on the Hbase table using Spark-sql API we can access the data. However when we update Hbase table while the SparkSQL & SparkConf is intialised we cannot see updated data. Is there any way we can have the RDD mapped to Hbase update

RDD storage in spark steaming

2015-03-23 Thread abhi
HI, i have a simple question about creating RDD . Whenever RDD is created in spark streaming for the particular time window .When does the RDD gets stored . 1. Does it get stored at the Driver machine ? or it gets stored on all the machines in the cluster ? 2. Does the data gets stored in memory b

Re: RDD storage in spark steaming

2015-03-23 Thread Jeffrey Jedele
Hey Abhi, many of StreamingContext's methods to create input streams take a StorageLevel parameter to configure this behavior. RDD partitions are generally stored in the in-memory cache of worker nodes I think. You can also configure replication and spilling to disk if needed. Regards, Jeff 2015-

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
Michael, thank you for the workaround and for letting me know of the upcoming enhancements, both of which sound appealing. On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust wrote: > You can include * and a column alias in the same select clause > var df1 = sqlContext.sql("select *, column_id AS

Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread , Roy
Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with following command PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics /home/abc/co

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-23 Thread Williams, Ken
> From: , Ken Williams > mailto:ken.willi...@windlogics.com>> > Date: Thursday, March 19, 2015 at 10:59 AM > To: Spark list mailto:user@spark.apache.org>> > Subject: JAVA_HOME problem with upgrade to 1.3.0 > > […] > Finally, I go and check the YARN app master’s web interface (since the job is >

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
> > You can include * and a column alias in the same select clause > var df1 = sqlContext.sql("select *, column_id AS table1_id from table1") FYI, this does not ultimately work as the * still includes column_id and you cannot have two columns of that name in the joined DataFrame. So I ended up a

Re: Spark 1.2. loses often all executors

2015-03-23 Thread mrm
Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-o

Re: Spark 1.2. loses often all executors

2015-03-23 Thread Ted Yu
In this thread: http://search-hadoop.com/m/JW1q5DM69G I only saw two replies. Maybe some people forgot to use 'Reply to All' ? Cheers On Mon, Mar 23, 2015 at 8:19 AM, mrm wrote: > Hi, > > I have received three replies to my question on my personal e-mail, why > don't they also show up on the m

Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action. Remov

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The mode is not deprecated, but the name "yarn-standalone" is now deprecated. It's now referred to as "yarn-cluster". -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 wrote: > Is yarn-standalone mode deprecated in Spark now. The reason I am asking is > because while I can find it in 0.9.0

Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread Ted Yu
InputSplit is in hadoop-mapreduce-client-core jar Please check that the jar is in your classpath. Cheers On Mon, Mar 23, 2015 at 8:10 AM, , Roy wrote: > Hi, > > > I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 > > I am trying to run one spark job with following com

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Yiannis Gkoufas
Hi Yin, Yes, I have set spark.executor.memory to 8g and the worker memory to 16g without any success. I cannot figure out how to increase the number of mapPartitions tasks. Thanks a lot On 20 March 2015 at 18:44, Yin Huai wrote: > spark.sql.shuffle.partitions only control the number of tasks i

Re: PySpark, ResultIterable and taking a list and saving it into different parquet files

2015-03-23 Thread chuwiey
In case anyone wants to learn about my solution for this: groupByKey is highly inefficient due to the swapping of elements between the different partitions as well as requiring enough mem in each worker to handle the elements for each group. So instead of using groupByKey, I ended up taking the fla

Re: Spark streaming alerting

2015-03-23 Thread Mohit Anchlia
I think I didn't explain myself properly :) What I meant to say was that generally spark worker runs on either on HDFS's data nodes or on Cassandra nodes, which typically is in a private network (protected). When a condition is matched it's difficult to send out the alerts directly from the worker

Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread Marcelo Vanzin
This feels very CDH-specific (or even CM-specific), I'd suggest following up on cdh-u...@cloudera.org instead. On Mon, Mar 23, 2015 at 8:10 AM, , Roy wrote: > 15/03/23 11:06:49 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 7, > hdp003.dev.xyz.com): java.lang.NoClassDefFoundError: > org/apa

Parquet file + increase read parallelism

2015-03-23 Thread SamyaMaiti
Hi All, Suppose I have a parquet file of 100 MB in HDFS & my HDFS block is 64MB, so I have 2 block of data. When I do, *sqlContext.parquetFile("path")* followed by an action , two tasks are stared on two partitions. My intend is to read this 2 blocks in more partitions to fully utilize my cluste

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The former is deprecated. However, the latter is functionally equivalent to it. Both launch an app in what is now called "yarn-cluster" mode. Oozie now also has a native Spark action, though I'm not familiar on the specifics. -Sandy On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak wrote: > To be m

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Sandy Ryza
Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty("key") isn't guaranteed, but new SparkConf().get("key") should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc wrote: > Hello, > > According to Spark Documentation at > https://spark.apa

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Martin Goodson
Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas wrote: > Hi Yin, > > Yes, I have set spark.executor.memory to 8g and

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Stephen Boesch
Is there a way to take advantage of the underlying datasource partitions when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql module that the only options are RangePartitioner and HashPartitioner - and further that those are selected automatically by the code . It was not app

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: > > On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > >> so finally i can resort to: >> rdd.saveAsObjectFile(...) >> sc.objectFile(...) >> but that seems like a rat

Re: version conflict common-net

2015-03-23 Thread Jacob Abraham
Hi Sean, Thanks a ton for you reply. The particular situation I have is case (3) that you have mentioned. The class that I am using from commons-net is FTPClient(). This class is present in both the 2.2 version and the 3.3 version. However, in the 3.3 version there are two additional methods (amo

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > i jus

Re: version conflict common-net

2015-03-23 Thread Sean Owen
I think it's spark.yarn.user.classpath.first in 1.2, and spark.{driver,executor}.extraClassPath in 1.3. Obviously that's for if you are using YARN, in the first instance. On Mon, Mar 23, 2015 at 5:41 PM, Jacob Abraham wrote: > Hi Sean, > > Thanks a ton for you reply. > > The particular situation

Re: Write to Parquet File in Python

2015-03-23 Thread chuwiey
Hey Akriti23, pyspark gives you a saveAsParquetFile() api, to save your rdd as parquet. You will however, need to infer the schema or describe it manually before you can do so. Here are some docs about that (v1.2.1, you can search for the others, they're relatively similar 1.1 and up): http://spa

Re: Spark streaming alerting

2015-03-23 Thread Tathagata Das
Something like that is not really supported out of the box. You will have to implement your RPC mechanism (sending stuff back to the driver for forwarding) own for that. TD On Mon, Mar 23, 2015 at 9:43 AM, Mohit Anchlia wrote: > I think I didn't explain myself properly :) What I meant to say wa

Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Ofer Mendelevitch
Hi, I am running into a strange issue when doing a JOIN of two RDDs followed by ZIP from PySpark. It’s part of a more complex application, but was able to narrow it down to a simplified example that’s easy to replicate and causes the same problem to appear: raw = sc.parallelize([('k'+str(x),'

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert
Thanks for the information! (to all who responded) The code below *seems* to work.Any hidden gotcha's that anyone sees? And still, in "terasort", how did they check that the data was actually sorted? :-) -Mike class MyInputFormat[T]    extends parquet.hadoop.ParquetInputFormat[T]{     override def

Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread , Roy
now job started but it stuck at some different level 15/03/23 14:09:00 INFO BlockManagerInfo: Added rdd_10_0 on disk on ip-10-0-3-171.ec2.internal:58704 (size: 138.8 MB) so i checked yarn logs on ip-10-0-3-171.ec2.internal but I dint see any errors ? anyone know whats going on here ? *Thanks.

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think the explanation is that the join does not guarantee any order, since it causes a shuffle in general, and it is computed twice in the first example, resulting in a difference for d1 and d2. You can persist() the result of the join and in practice I believe you'd find it behaves as expected,

Converting SparkSQL query to Scala query

2015-03-23 Thread nishitd
I have a complex SparkSQL query of the nature select a.a, b.b, c.c from a,b,c where a.x = b.x and b.y = c.y How do I convert this efficiently into scala query of a.join(b,..,..) and so on. Can anyone help me with this? If my question needs more clarification, please let me know. -- View this

newbie quesiton - spark with mesos

2015-03-23 Thread Anirudha Jadhav
i have a mesos cluster, which i deploy spark to by using instructions on http://spark.apache.org/docs/0.7.2/running-on-mesos.html after that the spark shell starts up fine. then i try the following on the shell: val data = 1 to 1 val distData = sc.parallelize(data) distData.filter(_< 10).co

Spark-thriftserver Issue

2015-03-23 Thread Neil Dev
Hi, I am having issue starting spark-thriftserver. I'm running spark 1.3.with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master s

Re: newbie quesiton - spark with mesos

2015-03-23 Thread Dean Wampler
That's a very old page, try this instead: http://spark.apache.org/docs/latest/running-on-mesos.html When you run your Spark job on Mesos, tasks will be started on the slave nodes as needed, since "fine-grained" mode is the default. For a job like your example, very few tasks will be needed. Actu

Getting around Serializability issues for types not in my control

2015-03-23 Thread adelbertc
Hey all, I'd like to use the Scalaz library in some of my Spark jobs, but am running into issues where some stuff I use from Scalaz is not serializable. For instance, in Scalaz there is a trait /** In Scalaz */ trait Applicative[F[_]] { def apply2[A, B, C](fa: F[A], fb: F[B])(f: (A, B) => C): F

Re: Converting SparkSQL query to Scala query

2015-03-23 Thread Dean Wampler
There isn't any automated way. Note that as the DataFrame implementation improves, it will probably do a better job with query optimization than hand-rolled Scala code. I don't know if that's true yet, though. For now, there are a few examples at the beginning of the DataFrame scaladocs

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think this is a bad example since testData is not deterministic at all. I thought we had fixed this or similar examples in the past? As in https://github.com/apache/spark/pull/1250/files Hm, anyone see a reason that shouldn't be changed too? On Mon, Mar 23, 2015 at 7:00 PM, Ofer Mendelevitch w

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Ofer Mendelevitch
Thanks Sean, Sorting definitely solves it, but I was hoping it could be avoided :) In the documentation for Classification in ML-Lib for example, zip() is used to create labelsAndPredictions: - from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and pa

SparkEnv

2015-03-23 Thread Koert Kuipers
is it safe to access SparkEnv.get inside say mapPartitions? i need to get a Serializer (so SparkEnv.get.serializer) thanks

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Dean Wampler
Well, it's complaining about trait OptionInstances which is defined in Option.scala in the std package. Use scalap or javap on the scalaz library to find out which member of the trait is the problem, but since it says "$$anon$1", I suspect it's the first value member, "implicit val optionInstance",

JDBC DF using DB2

2015-03-23 Thread Jack Arenas
Hi Team, I’m trying to create a DF using jdbc as detailed here – I’m currently using DB2 v9.7.0.6 and I’ve tried to use the db2jcc.jar and db2jcc_license_cu.jar combo, and while it works in --master local using the command below, I get some strange behavior in --master yarn-client. Here is th

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do this: https://github.com/sigmoidanalytics/spork On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin wrote: > Hi, all > > > > Can spark use pig’s load function to load data? > > > > Best Regards, > > Kevin. >

Re: JDBC DF using DB2

2015-03-23 Thread Ted Yu
bq. is to modify compute_classpath.sh on all worker nodes to include your driver JARs. Please follow the above advice. Cheers On Mon, Mar 23, 2015 at 12:34 PM, Jack Arenas wrote: > Hi Team, > > > > I’m trying to create a DF using jdbc as detailed here >

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Adelbert Chang
Is there no way to pull out the bits of the instance I want before I sent it through the closure for aggregate? I did try pulling things out, along the lines of def foo[G[_], B](blah: Blah)(implicit G: Applicative[G]) = { val lift: B => G[RDD[B]] = b => G.point(sparkContext.parallelize(List(b)))

Re: SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Michael Armbrust
There is not an interface to this at this time, and in general I'm hesitant to open up interfaces where the user could make a mistake where they think something is going to improve performance but will actually impact correctness. Since, as you say, we are picking the partitioner automatically in

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Cody Koeninger
Have you tried instantiating the instance inside the closure, rather than outside of it? If that works, you may need to switch to use mapPartition / foreachPartition for efficiency reasons. On Mon, Mar 23, 2015 at 3:03 PM, Adelbert Chang wrote: > Is there no way to pull out the bits of the ins

Re: Spark per app logging

2015-03-23 Thread Udit Mehta
Yes each application can use its own log4j.properties but I am not sure how to configure log4j so that the driver and executor write to file. This is because if we set the "spark.executor.extraJavaOptions" it will read from a file and that is not what I need. How do I configure log4j from the app s

objectFile uses only java serializer?

2015-03-23 Thread Koert Kuipers
in the comments on SparkContext.objectFile it says: "It will also be pretty slow if you use the default serializer (Java serialization)" this suggests the spark.serializer is used, which means i can switch to the much faster kryo serializer. however when i look at the code it uses Utils.deserializ

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
Maybe implement a very simple function that uses the Hadoop API to read in based on file names (i.e. parts)? On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers wrote: > there is a way to reinstate the partitioner, but that requires > sc.objectFile to read exactly what i wrote, which means sc.object

Re: objectFile uses only java serializer?

2015-03-23 Thread Ted Yu
bq. it uses Utils.deserialize, which is always using Java serialization. I agree with your finding. On Mon, Mar 23, 2015 at 1:14 PM, Koert Kuipers wrote: > in the comments on SparkContext.objectFile it says: > "It will also be pretty slow if you use the default serializer (Java > serialization)

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Adelbert Chang
Instantiating the instance? The actual instance it's complaining about is: https://github.com/scalaz/scalaz/blob/16838556c9309225013f917e577072476f46dc14/core/src/main/scala/scalaz/std/Option.scala#L10-11 The specific import where it's picking up the instance is: https://github.com/scalaz/scalaz

hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers
currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2 cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in my cluster. On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin wrote: > Since you're using YARN, you should be able to download a Spark 1.3.0 > tarball

Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := "sparkapp" vers

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
Found the issue above error - the setting for spark_shuffle was incomplete. Now it is able to ask and get additional executors. The issue is once they are released, it is not able to proceed with next query. The environment is CDH 5.3.2 (Hadoop 2.5) with Kerberos & Spark 1.3 After idle time, the

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Marcelo Vanzin
On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel wrote: > Found the issue above error - the setting for spark_shuffle was incomplete. > > Now it is able to ask and get additional executors. The issue is once they > are released, it is not able to proceed with next query. That looks like SPARK-6325, w

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav
for me, it's only working if I set --driver-class-path to mysql library. On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang wrote: > OK,I found what the problem is: It couldn't work with > mysql-connector-5.0.8. > I updated the connector version to 5.1.34 and it worked. > > > > -- > View this message

Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
Hello, I am running TeraSort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was s

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
Log shows stack traces that seem to match the assert in JIRA so it seems I am hitting the issue. Thanks for the heads up ... 15/03/23 20:29:50 ERROR actor.OneForOneStrategy: assertion failed: Allocator killed more executors than are allocated! java.lang.AssertionError: assertion failed: Allocator

Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Sandy Ryza
Hi Bijay, The Shuffle Spill (Disk) is the total number of bytes written to disk by records spilled during the shuffle. The Shuffle Spill (Memory) is the amount of space the spilled records occupied in memory before they were spilled. These differ because the serialized format is more compact, an

Re: Spark-thriftserver Issue

2015-03-23 Thread Zhan Zhang
Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev mailto:neilk...@gmail.com>

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
Hey Yiannis, If you just perform a count on each "name", "date" pair... can it succeed? If so, can you do a count and then order by to find the largest one? I'm wondering if there is a single pathologically large group here that is somehow causing OOM. Also, to be clear, you are getting GC limit

Weird exception in Spark job

2015-03-23 Thread nitinkak001
I am trying to run a Hive query from Spark code through HiveContext. Anybody knows what these exceptions mean? I have no clue LogType: stderr LogLength: 3345 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3

  1   2   >