date:20150427

Thanks, it should be
“select id, time, min(x1), min(x2), … from data group by id, time order by time”

(“min” or other aggregate function to pick other fields)

Forgot to mention that (id, time) is my primary key and I took for granted that 
it worked in my MySQL example.

Best regards, Alexander


From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Monday, April 27, 2015 12:47 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Group by order by

It's not related to Spark, but the concept of what you are trying to do with 
the data. Grouping by ID means consolidating data for each ID down to 1 row per 
ID. You can sort by time after this point yes, but you would need to either 
take each ID and time value pair OR do some aggregate operation on the time. 
That's what the error message is explaining. Maybe you can describe what you 
want your results to look like?

Here is some detail about the underlying operations here:

Example Data:
ID |  Time |  SomeVal

102-02-154
1   02-03-15 5
2   02-02-15 4
2   02-02-15 5
2   02-05-15 2

A.

So if you do Group By ID this means 1 row per ID like below:

ID

1
2

To include Time in this projection you need to aggregate it with a function to 
a single value. Then and only then can you use it in the projection and sort on 
it.

SELECT id, max(time) FROM sample GROUP BY id SORT BY max(time) desc;

ID  | max(time)
2 02-05-15
1 02-03-15

B.

Or if you do Group by ID, time then you get 1 row per ID and time pair:

ID | Time
102-02-15
102-03-15
202-02-15
202-05-15

Notice both rows with ID `2` and time `02-02-15` group down to 1 row in the 
results here. In this case you can sort the results by time without using an 
aggregate function.

SELECT id, time FROM sample GROUP BY id, time SORT BY time desc;

ID | Time
202-05-15
102-03-15
102-02-15
202-02-15

On Mon, Apr 27, 2015 at 3:28 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Richard,

There are several values of time per id. Is there a way to perform group by id 
and sort by time in Spark?

Best regards, Alexander

From: Richard Marscher 
[mailto:rmarsc...@localytics.commailto:rmarsc...@localytics.com]
Sent: Monday, April 27, 2015 12:20 PM
To: Ulanov, Alexander
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Group by order by

Hi,

that error seems to indicate the basic query is not properly expressed. If you 
group by just ID, then that means it would need to aggregate all the time 
values into one value per ID, so you can't sort by it. Thus it tries to suggest 
an aggregate function for time so you can have 1 value per ID and properly sort 
it.

On Mon, Apr 27, 2015 at 3:07 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi,

Could you suggest what is the best way to do “group by x order by y” in Spark?

When I try to perform it with Spark SQL I get the following error (Spark 1.3):

val results = sqlContext.sql(select * from sample group by id order by time)
org.apache.spark.sql.AnalysisException: expression 'time' is neither present in 
the group by, nor is it an aggregate function. Add to group by or wrap in 
first() if you don't care which value you get.;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)

Is there a way to do it with just RDD?

Best regards, Alexander

Re: Spark timeout issue

2015-04-27 Thread Akhil Das

You need to look more deep into your worker logs, you may find GC error, IO
exceptions etc if you look closely which is triggering the timeout.

Thanks
Best Regards

On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com
wrote:

 Hello Patrick,

 Sure. I've posted this on user as well. Will be cool to get a response.

 Thanks
 Deepak

 On Mon, Apr 27, 2015 at 2:58 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hi Deepak - please direct this to the user@ list. This list is for
 development of Spark itself.

 On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan
 dgk...@gmail.com wrote:
  Hello All,
 
  I'm trying to process a 3.5GB file on standalone mode using spark. I
 could
  run my spark job succesfully on a 100MB file and it works as expected.
 But,
  when I try to run it on the 3.5GB file, I run into the below error :
 
 
  15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block
 taskresult_83
  15/04/26 12:46:46 WARN AkkaUtils: Error sending message [message =
  Heartbeat(2,[Lscala.Tuple2;@790223d3,BlockManagerId(2,
  master.spark.com, 39143))] in 1 attempts
  java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]
  at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
  at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)
  at
 org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
  at
 org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
  15/04/26 12:47:15 INFO MemoryStore: ensureFreeSpace(26227673) called
  with curMem=265897, maxMem=5556991426
  15/04/26 12:47:15 INFO MemoryStore: Block taskresult_92 stored as
  bytes in memory (estimated size 25.0 MB, free 5.2 GB)
  15/04/26 12:47:16 INFO MemoryStore: ensureFreeSpace(26272879) called
  with curMem=26493570, maxMem=5556991426
  15/04/26 12:47:16 INFO MemoryStore: Block taskresult_94 stored as
  bytes in memory (estimated size 25.1 MB, free 5.1 GB)
  15/04/26 12:47:18 INFO MemoryStore: ensureFreeSpace(26285327) called
  with curMem=52766449, maxMem=5556991426
 
 
  and the job fails.
 
 
  I'm on AWS and have opened all ports. Also, since the 100MB file works,
 it
  should not be a connection issue.  I've a r3 xlarge and 2 m3 large.
 
  Can anyone suggest a way to fix this?




 --
 Regards,
 *Deepak Gopalakrishnan*
 *Mobile*:+918891509774
 *Skype* : deepakgk87
 http://myexps.blogspot.com

Re: How to debug Spark on Yarn?

Spark 1.3

1. View stderr/stdout from executor from Web UI: when the job is running i
figured out the executor that am suppose to see, and those two links show 4
special characters on browser.

2. Tail on Yarn logs:

/apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059 | less
Threw me: Application has not completed. Logs are only available after an
application completes


Any other ideas that i can try ?



On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




-- 
Deepak

Re: Understand the running time of SparkSQL queries

2015-04-27 Thread Akhil Das

Isn't it already available on the driver UI (that runs on 4040)?

Thanks
Best Regards

On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote:

 Hi,

 I am wondering how should we understand the running time of SparkSQL
 queries? For example the physical query plan and the running time on each
 stage? Is there any guide talking about this?

 Thank you!

 Best,
 Wenlei

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha

The answer is it depends :)

The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.

You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory allocation s.

Best
Ayan
On 27 Apr 2015 17:59, Mani man...@vt.edu wrote:

 Hi,

 I am a graduate student from Virginia Tech (USA) pursuing my Masters in
 Computer Science. I’ve been researching on parallel and distributed
 databases and their performance for running some Range queries involving
 simple joins and group by on large datasets. As part of my research, I
 tried evaluating query performance of Spark SQL on the data set that I
 have. It would be really great if you could please confirm on the numbers
 that I get from Spark SQL? Following is the type of query that am running,

 Table 1 - 22,000,483 records
 Table 2 - 10,173,311 records

 Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND
 a.z=‘' GROUP BY b.x ORDER BY b.x

 Total Running Time
 4 Worker Nodes:177.68s
 8 Worker Nodes: 186.72s

 I am using Apache Spark 1.3.0 with the default configuration. Is the query
 running time reasonable? Is it because of non-availability of indexes
 increasing the query run time? Can you please clarify?

 Thanks
 Mani
 Graduate Student, Department of Computer Science
 Virginia Tech

A problem of using spark streaming to capture network packets

2015-04-27 Thread Hai Shan Wu


Hi Everyone

We use pcap4j to capture network packets and then use spark streaming to
analyze captured packets. However, we met a strange problem.

If we run our application on spark locally (for example, spark-submit
--master local[2]), then the program runs successfully.

If we run our application on spark standalone cluster, then the program
will tell us that NO NIFs found.

I also attach two test files for clarification.

So anyone can help on this? Thanks in advance!


(See attached file: PcapReceiver.java)(See attached file:
TestPcapSpark.java)

Best regards,

- Haishan

Haishan Wu (吴海珊)

IBM Research - China
Tel: 86-10-58748508
Fax: 86-10-58748330
Email: wuh...@cn.ibm.com
Lotus Notes: Hai Shan Wu/China/IBM

PcapReceiver.java
Description: Binary data


TestPcapSpark.java
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to debug Spark on Yarn?

1) Application container logs from Web RM UI never load on browser. I
eventually have to kill the browser.
2)  /apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059
| less emits logs only after the application has completed.

Are there no better ways to see the logs as they are emitted. Something
similar to hadoop world ?


On Mon, Apr 27, 2015 at 1:58 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 You can check container logs from RM web UI or when log-aggregation is
 enabled with the yarn command. There are other, but less convenient
 options.

 On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Spark 1.3

 1. View stderr/stdout from executor from Web UI: when the job is running
 i figured out the executor that am suppose to see, and those two links show
 4 special characters on browser.

 2. Tail on Yarn logs:

 /apache/hadoop/bin/yarn logs -applicationId
 application_1429087638744_151059 | less
 Threw me: Application has not completed. Logs are only available after an
 application completes


 Any other ideas that i can try ?



 On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




 --
 Deepak




-- 
Deepak

Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread Mani

Hi,

I am a graduate student from Virginia Tech (USA) pursuing my Masters in 
Computer Science. I’ve been researching on parallel and distributed databases 
and their performance for running some Range queries involving simple joins and 
group by on large datasets. As part of my research, I tried evaluating query 
performance of Spark SQL on the data set that I have. It would be really great 
if you could please confirm on the numbers that I get from Spark SQL? Following 
is the type of query that am running,

Table 1 - 22,000,483 records
Table 2 - 10,173,311 records

Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND 
a.z=‘' GROUP BY b.x ORDER BY b.x

Total Running Time
4 Worker Nodes:177.68s
8 Worker Nodes: 186.72s

I am using Apache Spark 1.3.0 with the default configuration. Is the query 
running time reasonable? Is it because of non-availability of indexes 
increasing the query run time? Can you please clarify?

Thanks
Mani
Graduate Student, Department of Computer Science
Virginia Tech

Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara

You can check container logs from RM web UI or when log-aggregation is
enabled with the yarn command. There are other, but less convenient options.

On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Spark 1.3

 1. View stderr/stdout from executor from Web UI: when the job is running i
 figured out the executor that am suppose to see, and those two links show 4
 special characters on browser.

 2. Tail on Yarn logs:

 /apache/hadoop/bin/yarn logs -applicationId
 application_1429087638744_151059 | less
 Threw me: Application has not completed. Logs are only available after an
 application completes


 Any other ideas that i can try ?



 On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




 --
 Deepak

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao

Hi Marco,

As I know, current combineByKey() does not expose the related argument
where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is
package private, if you can get the ShuffledRDD through reflection or other
way, the keyOrdering you set will be pushed down to shuffle. If you use a
combination of transformations to do it, the result will be same but the
efficiency may be different, some transformations will separate into
different stages, which will introduce additional shuffle.

Thanks
Jerry


2015-04-27 19:00 GMT+08:00 Marco marcope...@gmail.com:

 Hi,

 I'm trying, after reducing by key, to get data ordered among partitions
 (like RangePartitioner) and within partitions (like sortByKey or
 repartitionAndSortWithinPartition) pushing the sorting down to the
 shuffles machinery of the reducing phase.

 I think, but maybe I'm wrong, that the correct way to do that is that
 combineByKey call setKeyOrdering function on the ShuflleRDD that it
 returns.

 Am I wrong? Can be done by a combination of other transformations with
 the same efficiency?

 Thanks,
 Marco

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-defaults.conf

2015-04-27 Thread James King

Thanks.

I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile

But when I start worker like this

spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

I still get

failed to launch org.apache.spark.deploy.worker.Worker:
 Default is conf/spark-defaults.conf.
  15/04/27 11:51:33 DEBUG Utils: Shutdown hook called





On Mon, Apr 27, 2015 at 1:15 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 You should distribute your configuration file to workers and set the
 appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
 HADOOP_CONF_DIR, SPARK_CONF_DIR.

 On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote:

 I renamed spark-defaults.conf.template to spark-defaults.conf
 and invoked

 spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

 But I still get

 failed to launch org.apache.spark.deploy.worker.Worker:
 --properties-file FILE   Path to a custom Spark properties file.
  Default is conf/spark-defaults.conf.

 But I'm thinking it should pick up the default spark-defaults.conf from
 conf dir

 Am I expecting or doing something wrong?

 Regards
 jk

Bigints in pyspark

2015-04-27 Thread jamborta

hi all,

I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe then collect it in
pyspark, the bigints are stored and integers in python. 

(The problem is if I write it back to another table, I detect the hive type
programmatically from the python type, so it turns those columns to
integers)

Is that intended this way or a bug?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bigints-in-pyspark-tp22668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SQL UDF returning object of case class; regression from 1.2.0

2015-04-27 Thread Ophir Cohen

A short update: eventually we manually upgraded to 1.3.1 and the problem
fixed.
On Apr 26, 2015 2:26 PM, Ophir Cohen oph...@gmail.com wrote:

 I happened to hit the following issue that prevents me from using UDFs
 with case classes: https://issues.apache.org/jira/browse/SPARK-6054.

 The issue already fixed for 1.3.1 but we are working on Amazon and it
 looks that Amazon provide deployment of Spark 1.3.1 using their scripts.

 Did someone encounter the issue? Any suggestion will be happily taken,
 either for a workaround or for a way to deploy Spark 1.3.1 on EMR
 Thanks,
 Ophir

Exception in using updateStateByKey

2015-04-27 Thread Sea

Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

Re: Exception in using updateStateByKey

2015-04-27 Thread Ted Yu

Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/
receivedData/0/log-1430139541443-1430139601443 ?

Cheers

On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:

 Hi, all:
 I use function updateStateByKey in Spark Streaming, I need to store the
 states for one minite,  I set spark.cleaner.ttl to 120, the duration is 2
 seconds, but it throws Exception


 Caused by:
 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
 does not exist:
 spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

 at org.apache.hadoop.ipc.Client.call(Client.java:1347)
 at org.apache.hadoop.ipc.Client.call(Client.java:1300)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
 at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)

 Why?

 my code is

 ssc = StreamingContext(sc,2)
 kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
 kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
 .filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
 .filter(lambda x: x[1]['isExisted'] != 1) \
 .foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

Re: directory loader in windows

2015-04-27 Thread Steve Loughran


This a hadoop-side stack trace

it looks like the code is trying to get the filesystem permissions by running

%HADOOP_HOME%\bin\WINUTILS.EXE  ls -F


and something is triggering a null pointer exception.

There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not 
a known/fixed problem.

At a guess, your environment HADOOP_HOME environment variable isn't point to 
the right place. If that's the case there should have been a warning in the logs




Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

at org.apache.hadoop.util.Shell.run(Shell.java:455)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)

at java.lang.Thread.run(Unknown Source)

--
Best Regards,
Ayan Guha

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea

I make it to 240, it happens again when 240 seconds is reached.




--  --
??: 261810726;261810...@qq.com;
: 2015??4??27??(??) 10:24
??: Ted Yuyuzhih...@gmail.com; 

: ?? Exception in using updateStateByKey



Yes??I can make it larger, but I also want to know whether there is a formula 
to estimate it




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 10:20
??: Sea261810...@qq.com; 

: Re: Exception in using updateStateByKey



Can you make the value for spark.cleaner.ttl larger ?Cheers


On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote:
my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is 
that?? when  the checkpoint info is deleted(it depends on  
??spark.cleaner.ttl??)??it will throw this exception??
 



-  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 9:55
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; 
: Re: Exception in using updateStateByKey



Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted 
spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 ?


Cheers


On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

RE: Slower performance when bigger memory?

2015-04-27 Thread Shuai Zheng

Thanks. So may I know what is your configuration for more/smaller executors on 
r3.8xlarge, how big of the memory that you eventually decide to give one 
executor without impact performance (for example: 64g? ).

From: Sven Krasser [mailto:kras...@gmail.com] 
Sent: Friday, April 24, 2015 1:59 PM
To: Dean Wampler
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Slower performance when bigger memory?

FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller 
executors. Another observation was that one large executor results in less 
overall read throughput from S3 (using Amazon's EMRFS implementation) in case 
that matters to your application.

-Sven

On Thu, Apr 23, 2015 at 10:18 AM, Dean Wampler deanwamp...@gmail.com wrote:

JVM's often have significant GC overhead with heaps bigger than 64GB. You might 
try your experiments with configurations below this threshold.

dean

Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition 
http://shop.oreilly.com/product/0636920033073.do  (O'Reilly)

Typesafe http://typesafe.com 
@deanwampler http://twitter.com/deanwampler 

http://polyglotprogramming.com

On Thu, Apr 23, 2015 at 12:14 PM, Shuai Zheng szheng.c...@gmail.com wrote:

Hi All,

I am running some benchmark on r3*8xlarge instance. I have a cluster with one 
master (no executor on it) and one slave (r3*8xlarge).

My job has 1000 tasks in stage 0.

R3*8xlarge has 244G memory and 32 cores.

If I create 4 executors, each has 8 core+50G memory, each task will take around 
320s-380s. And if I only use one big executor with 32 cores and 200G memory, 
each task will take 760s-900s.

And I check the log, looks like the minor GC takes much longer when using 200G 
memory:

285.242: [GC [PSYoungGen: 29027310K-8646087K(31119872K)] 
38810417K-19703013K(135977472K), 11.2509770 secs] [Times: user=38.95 
sys=120.65, real=11.25 secs] 

And when it uses 50G memory, the minor GC takes only less than 1s.

I try to see what is the best way to configure the Spark. For some special 
reason, I tempt to use a bigger memory on single executor if no significant 
penalty on performance. But now looks like it is?

Anyone has any idea?

Regards,

Shuai

-- 

www.skrasser.com http://www.skrasser.com/?utm_source=sig

Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie

I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s
for 400MB data. The schema is similar to the TPC-H LineItem.

Here is the code I tried the cache. I am wondering if there is any setting
missing?

Thank you so much!

lineitemSchemaRDD.registerTempTable(lineitem);
sqlContext.sqlContext().cacheTable(lineitem);
System.out.println(lineitemSchemaRDD.count());


On Mon, Apr 6, 2015 at 8:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com

Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com

Hi,

I am frequently asked why spark is also much faster than Hadoop MapReduce on 
disk (without the use of memory cache). I have no convencing answer for this 
question, could you guys elaborate on this? Thanks!

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Ilya Ganelin

I believe the typical answer is that Spark is actually a bit slower.
On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote:

 Hi,

 I am frequently asked why spark is also much faster than Hadoop MapReduce
 on disk (without the use of memory cache). I have no convencing answer for
 this question, could you guys elaborate on this? Thanks!

 --

New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake

https://issues.apache.org/jira/browse/SPARK-7182

Can anyone suggest a workaround for the above issue?

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/

RE: Scalability of group by

It works on a smaller dataset of 100 rows. Probably I could find the size when 
it fails using binary search. However, it would not help me because I need to 
work with 2B rows.

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, April 27, 2015 6:58 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Scalability of group by

Hi

Can you test on a smaller dataset to identify if it is cluster issue or scaling 
issue in spark
On 28 Apr 2015 11:30, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi,

I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in 
Spark 1.3 as follows:
“select id, time, first(value) from data group by id, time”

My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is 
allocated with 5GB of memory. However, all executors are being lost during the 
query execution and I get “ExecutorLostFailure”.

Could you suggest what might be the reason for it? Could it be that “group by” 
is implemented as RDD.groupBy so it holds the group by result in memory? What 
is the workaround?

Best regards, Alexander

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha

Spark keeps job in memory by default for kind of performance gains you are
seeing. Additionally depending on your query spark runs stages and any
point of time spark's code behind the scene may issue explicit cache. If
you hit any such scenario you will find those cached objects in UI under
storage. Note if caching is done by spark it may be transient.
On 28 Apr 2015 08:00, Wenlei Xie wenlei@gmail.com wrote:

 Hi,

 I am trying to answer a simple query with SparkSQL over the Parquet file.
 When execute the query several times, the first run will take about 2s
 while the later run will take 0.1s.

 By looking at the log file it seems the later runs doesn't load the data
 from disk. However, I didn't enable any cache explicitly. Is there any
 automatic cache used by SparkSQL? Is there anyway to check this?

 Thank you?

 Best,
 Wenlei

Re: Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com

Is it? I learned somewhere else that spark's speed is 5~10 times faster than 
Hadoop MapReduce.

bit1...@163.com

From: Ilya Ganelin
Date: 2015-04-28 10:55
To: bit1...@163.com; user
Subject: Re: Why Spark is much faster than Hadoop MapReduce even on disk
I believe the typical answer is that Spark is actually a bit slower. 
On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote:
Hi,

I am frequently asked why spark is also much faster than Hadoop MapReduce on 
disk (without the use of memory cache). I have no convencing answer for this 
question, could you guys elaborate on this? Thanks!

Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar

Hi Forum

I am facing below compile error when using the fileStream method of the
JavaStreamingContext class.
I have copied the code from JavaAPISuite.java test class of spark test code.

Please help me to find a solution for this.

http://apache-spark-user-list.1001560.n3.nabble.com/file/n22679/47.png 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-JavaStreamingContext-fileStream-compile-error-tp22679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scalability of group by

2015-04-27 Thread ayan guha

Hi

Can you test on a smaller dataset to identify if it is cluster issue or
scaling issue in spark
On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote:

  Hi,



 I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
 in Spark 1.3 as follows:

 “select id, time, first(value) from data group by id, time”



 My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor
 is allocated with 5GB of memory. However, all executors are being lost
 during the query execution and I get “ExecutorLostFailure”.



 Could you suggest what might be the reason for it? Could it be that “group
 by” is implemented as RDD.groupBy so it holds the group by result in
 memory? What is the workaround?



 Best regards, Alexander

java.lang.UnsupportedOperationException: empty collection

2015-04-27 Thread xweb

I am running following code on Spark 1.3.0. It is from
https://spark.apache.org/docs/1.3.0/ml-guide.html
On running val model1 = lr.fit(training.toDF) I get
java.lang.UnsupportedOperationException: empty collection

what could be the reason?



import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.{Row, SQLContext}

val conf = new SparkConf().setAppName(SimpleParamsExample)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

// Prepare training data.
// We use LabeledPoint, which is a case class.  Spark SQL can convert RDDs
of case classes
// into DataFrames, where it uses the case class metadata to infer the
schema.
val training = sc.parallelize(Seq(
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5

// Create a LogisticRegression instance.  This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println(LogisticRegression parameters:\n + lr.explainParams() + \n)

// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model.  This uses the parameters stored in lr.
*val model1 = lr.fit(training.toDF)*


*Some more information:*
scala training.toDF
res26: org.apache.spark.sql.DataFrame = [label: double, features: vecto]

scala training.toDF.collect()
res27: Array[org.apache.spark.sql.Row] = Array([1.0,[0.0,1.1,0.1]],
[0.0,[2.0,1.0,-1.0]], [0.0,[2.0,1.3,1.0]], [1.0,[0.0,1.2,-0.5]])




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-UnsupportedOperationException-empty-collection-tp22677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-27 Thread sranga

Hi

I am getting the following error when persisting an RDD in parquet format to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is appreciated. 

Caused by: java.lang.AssertionError: assertion failed: Conflicting partition
column names detected:
ArrayBuffer(batch_id)
ArrayBuffer()
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.parquet.ParquetRelation2$.resolvePartitions(newParquet.scala:933)
at
org.apache.spark.sql.parquet.ParquetRelation2$.parsePartitions(newParquet.scala:851)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:311)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:303)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:303)
at
org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:692)
at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)
at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
at
org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:995)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-3-1-Persisting-RDD-in-parquet-Conflicting-partition-column-names-tp22678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak

http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks

  From: bit1...@163.com bit1...@163.com
 To: user user@spark.apache.org 
 Sent: Monday, April 27, 2015 8:33 PM
 Subject: Why Spark is much faster than Hadoop MapReduce even on disk

#yiv1713360705 body {line-height:1.5;}#yiv1713360705 body 
{font-size:10.5pt;color:rgb(0, 0, 0);line-height:1.5;}Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce on 
disk (without the use of memory cache). I have no convencing answer for this 
question, could you guys elaborate on this? Thanks!

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee

Similar to what Dean called out, we build Puppet manifests so we could do
the automation - its a bit of work to setup, but well worth the effort.

On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote:

 It's mostly manual. You could try automating with something like Chef, of
 course, but there's nothing already available in terms of automation.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Fri, Apr 24, 2015 at 10:33 AM, James King jakwebin...@gmail.com
 wrote:

 Thanks Dean,

 Sure I have that setup locally and testing it with ZK.

 But to start my multiple Masters do I need to go to each host and start
 there or is there a better way to do this.

 Regards
 jk

 On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 The convention for standalone cluster is to use Zookeeper to manage
 master failover.

 http://spark.apache.org/docs/latest/spark-standalone.html

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Fri, Apr 24, 2015 at 5:01 AM, James King jakwebin...@gmail.com
 wrote:

 I'm trying to find out how to setup a resilient Spark cluster.

 Things I'm thinking about include:

 - How to start multiple masters on different hosts?
 - there isn't a conf/masters file from what I can see


 Thank you.

Driver ID from spark-submit

2015-04-27 Thread Rares Vernica

Hello,

I am trying to use the default Spark cluster manager in a production
environment. I will be submitting jobs with spark-submit. I wonder if the
following is possible:

1. Get the Driver ID from spark-submit. We will use this ID to keep track
of the job and kill it if necessary.

2. Weather it is possible to run spark-submit in a mode where it ends and
returns control to the user immediately after the job is submitted.

Thanks!
Rares

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea

Maybe I found the solution??do not set 'spark.cleaner.ttl', just use function 
'remember' in StreamingContext to set the rememberDuration. 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 10:20
??: Sea261810...@qq.com; 

: Re: Exception in using updateStateByKey



Can you make the value for spark.cleaner.ttl larger ?Cheers


On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote:
my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is 
that?? when  the checkpoint info is deleted(it depends on  
??spark.cleaner.ttl??)??it will throw this exception??
 



-  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 9:55
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; 
: Re: Exception in using updateStateByKey



Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted 
spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 ?


Cheers


On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak

Hi,
 I have been trying out spark data source api with JDBC. The following is
the code to get DataFrame,

 Try(hc.load(org.apache.spark.sql.jdbc,Map(url - dbUrl,dbtable-s($
query) )))


By looking at test cases, I found that query has to be inside brackets,
otherwise it's treated as table name.  But with when used with MySQL, query
inside the ( ) is treated as derived table which is throwing exception. Is
this the right way to pass the queries to jdbc source or am I missing
something?


Regards,
Madhukara Phatak
http://datamantra.io/

Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
and direct link. Each time i untar i get below error

spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty
error message)

tar: Error exit delayed from previous errors


Is it broken ?

-- 
Deepak

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya

Marco - why do you want data sorted both within and across partitions? If you 
need to take an ordered sequence across all your data you need to either 
aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an 
ordered index to your data that matches the order it was stored on HDFS. You 
can then get the data in order by filtering based on that index. Let me know if 
that's not what you need - thanks!



Sent with Good (www.good.com)


-Original Message-
From: Marco [marcope...@gmail.commailto:marcope...@gmail.com]
Sent: Monday, April 27, 2015 07:01 AM Eastern Standard Time
To: user@spark.apache.org
Subject: ReduceByKey and sorting within partitions


Hi,

I'm trying, after reducing by key, to get data ordered among partitions
(like RangePartitioner) and within partitions (like sortByKey or
repartitionAndSortWithinPartition) pushing the sorting down to the
shuffles machinery of the reducing phase.

I think, but maybe I'm wrong, that the correct way to do that is that
combineByKey call setKeyOrdering function on the ShuflleRDD that it returns.

Am I wrong? Can be done by a combination of other transformations with
the same efficiency?

Thanks,
Marco

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

data locality in spark

2015-04-27 Thread Grandl Robert

Hi guys,
I am running some SQL queries, but all my tasks are reported as either 
NODE_LOCAL or PROCESS_LOCAL. 
In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because 
they have to aggregate data from multiple hosts. However, in Spark even the 
aggregation stages are reported as NODE/PROCESS LOCAL.
Do I miss something, or why the reduce-like tasks are still NODE/PROCESS LOCAL ?
Thanks,Robert

RE: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Ganelin, Ilya

What command are you using to untar? Are you running out of disk space?



Sent with Good (www.good.com)


-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com]
Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time
To: user
Subject: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and 
direct link. Each time i untar i get below error


spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty 
error message)

tar: Error exit delayed from previous errors


Is it broken ?

--
Deepak



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Timeout Error

2015-04-27 Thread Deepak Gopalakrishnan

Hello All,

I dug a little deeper and found this error :

15/04/27 16:05:39 WARN TransportChannelHandler: Exception in
connection from /10.1.0.90:40590
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/04/27 16:05:39 ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=45314884029,
chunkIndex=0}, buffer=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0
lim=26227673 cap=26227673]}} to /10.1.0.90:40590; closing connection
java.nio.channels.ClosedChannelException
15/04/27 16:05:39 ERROR TransportRequestHandler: Error sending result
RpcResponse{requestId=8439869725098873668, response=[B@1bdcdf63} to
/10.1.0.90:40590; closing connection
java.nio.channels.ClosedChannelException
15/04/27 16:05:39 ERROR CoarseGrainedExecutorBackend: Driver
Disassociated [akka.tcp://sparkexecu...@master.spark.com:60802] -
[akka.tcp://sparkdri...@master.spark.com:37195] disassociated!
Shutting down.
15/04/27 16:05:39 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkdri...@master.spark.com:37195] has
failed, address is now gated for [5000] ms. Reason is:
[Disassociated].


On Mon, Apr 27, 2015 at 8:35 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 The configuration key should be spark.akka.askTimeout for this timeout.
 The time unit is seconds.

 Best Regards,
 Shixiong(Ryan) Zhu

 2015-04-26 15:15 GMT-07:00 Deepak Gopalakrishnan dgk...@gmail.com:

 Hello,


 Just to add a bit more context :

 I have done that in the code, but I cannot see it change from 30 seconds
 in the log.

 .set(spark.executor.memory, 10g)

 .set(spark.driver.memory, 20g)

 .set(spark.akka.timeout,6000)

 PS : I understand that 6000 is quite large, but I'm just trying to see if
 it actually changes


 Here is the command that I'm running

  sudo MASTER=spark://master.spark.com:7077
 /opt/spark/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class
 class-name   --executor-memory 20G --driver-memory 10G  --deploy-mode
 client --conf spark.akka.timeout=6000 --conf spark.akka.askTimeout=6000
 jar file path


 and here is how I load the file JavaPairRDDString, String
 learningRdd=sc.wholeTextFiles(filePath,10);
 Thanks

 On Mon, Apr 27, 2015 at 3:36 AM, Bryan Cutler cutl...@gmail.com wrote:

 I'm not sure what the expected performance should be for this amount of
 data, but you could try to increase the timeout with the property
 spark.akka.timeout to see if that helps.

 Bryan

 On Sun, Apr 26, 2015 at 6:57 AM, Deepak Gopalakrishnan dgk...@gmail.com
  wrote:

 Hello All,

 I'm trying to process a 3.5GB file on standalone mode using spark. I
 could run my spark job succesfully on a 100MB file and it works as
 expected. But, when I try to run it on the 3.5GB file, I run into the below
 error :


 15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block 
 taskresult_83
 15/04/26 12:46:46 WARN AkkaUtils: Error sending message [message = 
 Heartbeat(2,[Lscala.Tuple2;@790223d3,BlockManagerId(2, master.spark.com, 
 39143))] in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
 15/04/26 12:47:15 INFO MemoryStore: ensureFreeSpace(26227673) called with 
 curMem=265897, maxMem=5556991426
 15/04/26

Re: What is difference btw reduce fold?

2015-04-27 Thread keegan

Hi Q,

fold and reduce both aggregate over a collection by implementing an
operation you specify, the major different is the starting point of the
aggregation. For fold(), you have to specify the starting value, and for
reduce() the starting value is the first (or possibly an arbitrary) element
in the collection. 

Simple examples - we can sum the numbers in a collection using both
functions:
(1 until 10).reduce( (a,b) = a+b )
(1 until 10).fold(0)( (a,b) = a+b )

With fold, we want to start at 0 and cumulatively add each element. In this
case, the operation passed to fold() and reduce() were very similar, but it
is helpful to think about fold in the following way. For the operation we
pass to fold(), imagine its two arguments are (i) the current accumulated
value and (ii) the next value in the collection,

(1 until 10).fold(0)( (accumulated_so_far, next_value) = accumulated_so_far
+ next_value ).

So the result of the operation, accumulated_so_far + next_value, will be
passed to the operation again as the first argument, and so on. 

In this way, we could count the number of elements in a collection using
fold,

(1 until 10).fold(0)( (accumulated_so_far, next_value) = accumulated_so_far
+ 1 ).


When it comes to Spark, here’s another thing to keep in mind. For both
reduce and fold, you need to make sure your operation is both commutative
and associative. For RDDs, reduce and fold are implemented on each partition
separately, and then the results are combined using the operation.  With
fold, this could get you into trouble because an empty partition will emit
fold’s starting value, so the number of partitions might erroneously affect
the result of the calculation, if you’re not careful about the operation.
This would occur with the ( (a,b) = a+1) operation from above (see
http://stackoverflow.com/questions/29150202/pyspark-fold-method-output). 

Hope this helps. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-difference-btw-reduce-fold-tp22653p22671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting error running MLlib example with new cluster

How did you run the example app? Did you use spark-submit? -Xiangrui

On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
 Sorry, accidentally sent the last email before finishing.

 I had asked this question before, but wanted to ask again as I think
 it is now related to my pom file or project setup. Really appreciate the help!

 I have been trying on/off for the past month to try to run this MLlib
 example: 
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 I am able to build the project successfully. When I run it, it returns:

 features in spam: 8
 features in ham: 7

 and then freezes. According to the UI, the description of the job is
 count at DataValidators.scala.38. This corresponds to this line in
 the code:

 val model = lrLearner.run(trainingData)

 I've tried just about everything I can think of...changed numFeatures
 from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
 this point I think I might have missed dependencies as that has
 usually been the problem in other spark apps I have tried to run. This
 is my pom file, that I have used for other successful spark apps.
 Please let me know if you think I need any additional dependencies or
 there are incompatibility issues, or a pom.xml that is better to use.
 Thank you!

 Cluster information:

 Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
 java version 1.7.0_25
 Scala version: 2.10.4
 hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



 project xmlns = http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
 =http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/maven-v4_0_0.xsd;
 groupId edu.berkely/groupId
 artifactId simple-project /artifactId
 modelVersion 4.0.0/modelVersion
 name Simple Project /name
 packaging jar /packaging
 version 1.0 /version
 repositories
 repository
 idcloudera/id
 url http://repository.cloudera.com/artifactory/cloudera-repos//url
 /repository

 repository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /repository

 /repositories

 pluginRepositories
 pluginRepository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /pluginRepository
 /pluginRepositories

 build
 plugins
 plugin
 groupIdorg.scala-tools/groupId
 artifactIdmaven-scala-plugin/artifactId
 executions

 execution
 idcompile/id
 goals
 goalcompile/goal
 /goals
 phasecompile/phase
 /execution
 execution
 idtest-compile/id
 goals
 goaltestCompile/goal
 /goals
 phasetest-compile/phase
 /execution
 execution
phaseprocess-resources/phase
goals
  goalcompile/goal
/goals
 /execution
 /executions
 /plugin
 plugin
 artifactIdmaven-compiler-plugin/artifactId
 configuration
 source1.7/source
 target1.7/target
 /configuration
 /plugin
 /plugins
 /build


 dependencies
 dependency !--Spark dependency --
 groupId org.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.2.0-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.5.0-mr1-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-library/artifactId
 version2.10.4/version
 /dependency

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-compiler/artifactId
 version2.10.4/version
 /dependency

 dependency
 groupIdcom.101tec/groupId
 artifactIdzkclient/artifactId
 version0.3/version
 /dependency

Re: MLlib - Collaborative Filtering - trainImplicit task size

Could you try different ranks and see whether the task size changes?
We do use YtY in the closure, which should work the same as broadcast.
If that is the case, it should be safe to ignore this warning.
-Xiangrui

On Thu, Apr 23, 2015 at 4:52 AM, Christian S. Perone
christian.per...@gmail.com wrote:
 All these warnings come from ALS iterations, from flatMap and also from
 aggregate, for instance the origin of the state where the flatMap is showing
 these warnings (w/ Spark 1.3.0, they are also shown in Spark 1.3.1):

 org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
 org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
 org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
 org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
 scala.collection.immutable.Range.foreach(Range.scala:141)
 org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
 org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

 And from the aggregate:

 org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
 org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
 org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
 org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
 org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
 scala.collection.immutable.Range.foreach(Range.scala:141)
 org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
 org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)



 On Thu, Apr 23, 2015 at 2:49 AM, Xiangrui Meng men...@gmail.com wrote:

 This is the size of the serialized task closure. Is stage 246 part of
 ALS iterations, or something before or after it? -Xiangrui

 On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone
 christian.per...@gmail.com wrote:
  Hi Sean, thanks for the answer. I tried to call repartition() on the
  input
  with many different sizes and it still continues to show that warning
  message.
 
  On Tue, Apr 21, 2015 at 7:05 AM, Sean Owen so...@cloudera.com wrote:
 
  I think maybe you need more partitions in your input, which might make
  for smaller tasks?
 
  On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone
  christian.per...@gmail.com wrote:
   I keep seeing these warnings when using trainImplicit:
  
   WARN TaskSetManager: Stage 246 contains a task of very large size
   (208
   KB).
   The maximum recommended task size is 100 KB.
  
   And then the task size starts to increase. Is this a known issue ?
  
   Thanks !
  
   --
   Blog | Github | Twitter
   Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great
   big
   joke on me.
 
 
 
 
  --
  Blog | Github | Twitter
  Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great
  big
  joke on me.




 --
 Blog | Github | Twitter
 Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great big
 joke on me.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Sean Owen

Works fine for me. Make sure you're not downloading the HTML
redirector page and thinking it's the archive.

On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
 and direct link. Each time i untar i get below error

 spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty
 error message)

 tar: Error exit delayed from previous errors


 Is it broken ?


 --
 Deepak


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: StandardScaler failing with OOM errors in PySpark

You might need to specify driver memory in spark-submit instead of
passing JVM options. spark-submit is designed to handle different
deployments correctly. -Xiangrui

On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar rokros...@gmail.com wrote:
ok yes, I think I have narrowed it down to being a problem with driver
memory settings. It looks like the application master/driver is not being
launched with the settings specified:

For the driver process on the main node I see -XX:MaxPermSize=128m -Xms512m
-Xmx512m as options used to start the JVM, even though I specified

'spark.yarn.am.memory', '5g'
'spark.yarn.am.memoryOverhead', '2000'

The info shows that these options were read:

15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with 7120 MB
memory including 2000 MB overhead

Is there some reason why these options are being ignored and instead
starting the driver with just 512Mb of heap?

On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar rokros...@gmail.com wrote:

the feature dimension is 800k.

yes, I believe the driver memory is likely the problem since it doesn't
crash until the very last part of the tree aggregation.

I'm running it via pyspark through YARN -- I have to run in client mode so
I can't set spark.driver.memory -- I've tried setting the
spark.yarn.am.memory and overhead parameters but it doesn't seem to have an
effect.

Thanks,

Rok

On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote:

What is the feature dimension? Did you set the driver memory? -Xiangrui

On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote:
I'm trying to use the StandardScaler in pyspark on a relatively small
(a few
hundred Mb) dataset of sparse vectors with 800k features. The fit
method of
StandardScaler crashes with Java heap space or Direct buffer memory
errors.
There should be plenty of memory around -- 10 executors with 2 cores
each
and 8 Gb per core. I'm giving the executors 9g of memory and have also
tried
lots of overhead (3g), thinking it might be the array creation in the
aggregators that's causing issues.

The bizarre thing is that this isn't always reproducible -- sometimes
it
actually works without problems. Should I be setting up executors
differently?

Thanks,

Rok

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Powered By Spark

2015-04-27 Thread Justin

Hi,

Would you mind adding our company to the Powered By Spark page?

organization name: atp
URL: https://atp.io
a list of which Spark components you are using: SparkSQL, MLLib, Databricks
Cloud
and a short description of your use case: Predictive models and
learning algorithms to improve the relevance of programmatic marketing


Thanks!
Justin




Justin Barton
CTO
+1 (718) 404 9272
+44 203 290 9272
atp.io | jus...@atp.io | find us https://atp.io/find-us

spark sql LEFT OUTER JOIN java.lang.ClassCastException

2015-04-27 Thread kiran mavatoor

Hi There,
I am using spark sql left out join query. 
The sql query is 
scala val test = sqlContext.sql(SELECT e.departmentID FROM employee e LEFT 
OUTER JOIN department d ON d.departmentId = e.departmentId).toDF()
In the spark 1.3.1 its working fine, but the latest pull is give the below error
15/04/27 23:02:49 ERROR Executor: Exception in task 4.0 in stage 67.0 (TID 
118)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 4.0 in stage 67.0 (TID 118) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 1]15/04/27 23:02:49 ERROR 
Executor: Exception in task 2.0 in stage 67.0 (TID 
116)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 2.0 in stage 67.0 (TID 116) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 2]15/04/27 23:02:49 ERROR 
Executor: Exception in task 3.0 in stage 67.0 (TID 
117)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 3.0 in stage 67.0 (TID 117) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 3]15/04/27 23:02:49 ERROR 
Executor: Exception in task 0.0 in stage 66.0 (TID 
112)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 0.0 in stage 66.0 (TID 112) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 1]15/04/27 23:02:49 INFO 
TaskSchedulerImpl: Removed TaskSet 66.0, whose tasks have all completed, from 
pool 15/04/27 23:02:49 ERROR Executor: Exception in task 5.0 in stage 67.0 (TID 
119)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 5.0 in stage 67.0 (TID 119) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 4]15/04/27 23:02:49 ERROR 
Executor: Exception in task 0.0 in stage 67.0 (TID 
114)java.lang.ClassCastException15/04/27 23:02:49 INFO TaskSetManager: Lost 
task 0.0 in stage 67.0 (TID 114) on executor localhost: 
java.lang.ClassCastException (null) [duplicate 5]15/04/27 23:02:49 INFO 
TaskSchedulerImpl: Removed TaskSet 67.0, whose tasks have all completed, from 
pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
in stage 66.0 failed 1 times, most recent failure: Lost task 1.0 in stage 66.0 
(TID 113, localhost): java.lang.ClassCastException
Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1241)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1232)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1231)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1231) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:705)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:705)
 at scala.Option.foreach(Option.scala:236) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:705)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1424)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1385)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
ThanksKiran.

Streaming app with windowing and persistence

2015-04-27 Thread Alexander Krasheninnikov


Hello, everyone.
I develop stream application, working with window functions - each 
window create table and perform some SQL-operations on extracted data.
I met such problem: when using window operations and checkpointing, 
application does not start next time.

Here is the code:



finalDuration batchDuration = Durations.seconds(10);
finalDuration slideDuration = Durations.seconds(10);
finalDuration windowDuration = Durations.seconds(600);

finalSparkConf conf =newSparkConf();
conf.setAppName(Streaming);
conf.setMaster(local[4]);


JavaStreamingContextFactory contextFactory =newJavaStreamingContextFactory() {
@Override
publicJavaStreamingContext create() {
JavaStreamingContext streamingContext 
=newJavaStreamingContext(conf,batchDuration);
streamingContext.checkpoint(CHECKPOINT_DIR);

returnstreamingContext;
}
};

JavaStreamingContext streamingContext = 
JavaStreamingContext.getOrCreate(CHECKPOINT_DIR,newConfiguration(), 
contextFactory,true);
JavaDStreamString lines = streamingContext.textFileStream(SOURCE_DIR);

lines.countByWindow(windowDuration,slideDuration).print();

streamingContext.start();
streamingContext.awaitTermination();



I expect, that after application restart, Spark will merge old event 
counter with new values (if it is not so, I am ready to merge old data 
manually).

But, after application restart, I have this error:
Exception in thread main org.apache.spark.SparkException: 
org.apache.spark.streaming.dstream.MappedDStream@49db6f23 has not been 
initialized
at 
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)

at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)

at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
at 
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
at 
org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:584)

at my.package.FileAggregations.main(FileAggregations.java:76)

At FileAggregations.java:76 is

streamingContext.start();

Spark version is 1.3.0.

---
wbr, Alexandr Krasheninnikov

deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-04-27 Thread Pagliari, Roberto

Suppose I have something like the code below


for idx in xrange(0, 10):
train_test_split = training.randomSplit(weights=[0.75, 0.25])
train_cv = train_test_split[0]
test_cv = train_test_split[1]
# scale train_cv and test_cv


by scaling train_cv and test_cv, will the original data be affected?

Thanks,

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Su She

Hello Xiangrui,

I am using this spark-submit command (as I do for all other jobs):

/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
--class MLlib --master local[2] --jars $(echo
/home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
/home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar

Thank you for the help!

Best,

Su


On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
 How did you run the example app? Did you use spark-submit? -Xiangrui

 On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
 Sorry, accidentally sent the last email before finishing.

 I had asked this question before, but wanted to ask again as I think
 it is now related to my pom file or project setup. Really appreciate the 
 help!

 I have been trying on/off for the past month to try to run this MLlib
 example: 
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 I am able to build the project successfully. When I run it, it returns:

 features in spam: 8
 features in ham: 7

 and then freezes. According to the UI, the description of the job is
 count at DataValidators.scala.38. This corresponds to this line in
 the code:

 val model = lrLearner.run(trainingData)

 I've tried just about everything I can think of...changed numFeatures
 from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
 this point I think I might have missed dependencies as that has
 usually been the problem in other spark apps I have tried to run. This
 is my pom file, that I have used for other successful spark apps.
 Please let me know if you think I need any additional dependencies or
 there are incompatibility issues, or a pom.xml that is better to use.
 Thank you!

 Cluster information:

 Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
 java version 1.7.0_25
 Scala version: 2.10.4
 hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



 project xmlns = http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
 =http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/maven-v4_0_0.xsd;
 groupId edu.berkely/groupId
 artifactId simple-project /artifactId
 modelVersion 4.0.0/modelVersion
 name Simple Project /name
 packaging jar /packaging
 version 1.0 /version
 repositories
 repository
 idcloudera/id
 url 
 http://repository.cloudera.com/artifactory/cloudera-repos//url
 /repository

 repository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /repository

 /repositories

 pluginRepositories
 pluginRepository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /pluginRepository
 /pluginRepositories

 build
 plugins
 plugin
 groupIdorg.scala-tools/groupId
 artifactIdmaven-scala-plugin/artifactId
 executions

 execution
 idcompile/id
 goals
 goalcompile/goal
 /goals
 phasecompile/phase
 /execution
 execution
 idtest-compile/id
 goals
 goaltestCompile/goal
 /goals
 phasetest-compile/phase
 /execution
 execution
phaseprocess-resources/phase
goals
  goalcompile/goal
/goals
 /execution
 /executions
 /plugin
 plugin
 artifactIdmaven-compiler-plugin/artifactId
 configuration
 source1.7/source
 target1.7/target
 /configuration
 /plugin
 /plugins
 /build


 dependencies
 dependency !--Spark dependency --
 groupId org.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.2.0-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.5.0-mr1-cdh5.3.0/version
 /dependency

Spark Job fails with 6 executors and succeeds with 8 ?

I have this Spark App and it fails when i run with 6 executors but succeeds
with 8.

Any suggestions ?

Command:

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar
--jars
/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/spark_reporting_dep_only-1.0-SNAPSHOT.jar
--num-executors 6 --driver-memory 12g --driver-java-options
-XX:MaxPermSize=8G --executor-memory 12g --executor-cores 6 --queue
hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
/home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-04-6 endDate=2015-04-7
input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem
output=/user/dvasthimal/epdatasets/viewItem buffersize=128
maxbuffersize=1068 maxResultSize=2G

Input Data Sets Size:

-sh-4.1$ hadoop fs -ls XX/YY/part-r-0

 2663019338 bytes

-sh-4.1$ hadoop fs -ls /AA/BB/part-r-0

 2688348022  bytes

-sh-4.1$ hadoop fs -ls /FOO/BAAR/ch196out83-r-0.avro

 1274065689 bytes

-sh-4.1$

Any thouhgts ?
Exception:
15/04/27 22:12:46 INFO Configuration.deprecation:
mapred.output.compression.codec is deprecated. Instead, use
mapreduce.output.fileoutputformat.compress.codec
15/04/27 22:12:47 ERROR executor.Executor: Exception in task 8.0 in stage
4.0 (TID 36)
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference
involving object SchemaUtil
at
scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1220)
at
scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1218)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.reflect.internal.Symbols$Symbol.lock(Symbols.scala:482)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1218)
at scala.reflect.internal.Symbols$Symbol.initialize(Symbols.scala:1374)
at scala.reflect.internal.Symbols$Symbol.hasFlag(Symbols.scala:607)
at scala.reflect.internal.Symbols$TermSymbol.isTermMacro(Symbols.scala:2453)
at scala.reflect.internal.Symbols$Symbol.symbolKind(Symbols.scala:2263)
at
scala.reflect.internal.Symbols$Symbol.sanitizedKindString(Symbols.scala:2297)
at scala.reflect.internal.Symbols$Symbol.kindString(Symbols.scala:2305)
at scala.reflect.internal.Symbols$Symbol.toString(Symbols.scala:2350)
at java.lang.String.valueOf(String.java:2847)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractTraversable.addString(Traversable.scala:105)
at
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at
scala.collection.TraversableLike$class.toString(TraversableLike.scala:639)
at scala.collection.SeqLike$class.toString(SeqLike.scala:646)
at scala.collection.AbstractSeq.toString(Seq.scala:40)
at java.lang.String.valueOf(String.java:2847)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:44)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at
com.ebay.ep.poc.spark.reporting.process.util.SchemaUtil$$typecreator3$1.apply(SchemaUtil.scala:108)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
at scala.reflect.api.Universe.typeOf(Universe.scala:59)
at
com.ebay.ep.poc.spark.reporting.process.util.SchemaUtil$.com$ebay$ep$poc$spark$reporting$process$util$SchemaUtil$$getField(SchemaUtil.scala:108)
at
com.ebay.ep.poc.spark.reporting.process.util.SchemaUtil$$anonfun$6$$anonfun$apply$4.apply(SchemaUtil.scala:94)
at
com.ebay.ep.poc.spark.reporting.process.util.SchemaUtil$$anonfun$6$$anonfun$apply$4.apply(SchemaUtil.scala:90)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.reflect.internal.Scopes$Scope.foreach(Scopes.scala:315)
at

Serialization error

2015-04-27 Thread madhvi


Hi,

While connecting to accumulo through spark by making sparkRDD I am 
getting the following error:

 object not serializable (class: org.apache.accumulo.core.data.Key)

This is due to the 'key' class of accumulo which does not implement 
serializable interface.How it can be solved and accumulo can be used 
with spark


Thanks
Madhvi

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unable to work with foreachrdd

2015-04-27 Thread drarse

Can u write the code? Maby is the foreachRDD body. :)

El martes, 28 de abril de 2015, CH.KMVPRASAD [via Apache Spark User List] 
ml-node+s1001560n22681...@n3.nabble.com escribió:

 When i run spark streaming application print method is  printing result it
 is f9, but i used foreachrdd on that dstream object it is not working ?
 why? what is the reason!

 please help me!

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-work-with-foreachrdd-tp22681.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 javascript:_e(%7B%7D,'cvml','ml-node%2bs1001560n1...@n3.nabble.com');
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=ZHJhcnNlLmFtZXNAZ21haWwuY29tfDF8MTUyMzY0MjQyMA==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



-- 
Atte. Sergio Jiménez




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-work-with-foreachrdd-tp22681p22682.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: gridsearch - python

We will try to make them available in 1.4, which is coming soon. -Xiangrui

On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
 I know grid search with cross validation is not supported. However, I was
 wondering if there is something availalable for the time being.



 Thanks,





 From: Punyashloka Biswal [mailto:punya.bis...@gmail.com]
 Sent: Thursday, April 23, 2015 9:06 PM
 To: Pagliari, Roberto; user@spark.apache.org
 Subject: Re: gridsearch - python



 https://issues.apache.org/jira/browse/SPARK-7022.

 Punya



 On Thu, Apr 23, 2015 at 5:47 PM Pagliari, Roberto rpagli...@appcomsci.com
 wrote:

 Can anybody point me to an example, if available, about gridsearch with
 python?



 Thank you,



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on Mesos

2015-04-27 Thread Stephen Carman

So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just 
basically got the pre-built from the spark website…

I placed those compiled spark installs on each slave at /opt/spark

My spark properties seem to be getting picked up on my side fine…

[cid:683C1BA0-C9EC-448C-B1DB-E93AC4576DE9@coldlight.corp]
The framework is registered in Mesos, it shows up just fine, it doesn’t matter 
if I turn off the executor uri or not, but I always get the same error…

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 
23, 10.253.1.117): ExecutorLostFailure (executor 
20150424-104711-1375862026-5050-20113-S1 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

These boxes are totally open to one another so they shouldn’t have any firewall 
issues, everything seems to show up in mesos and spark just fine, but actually 
running stuff totally blows up.

There is nothing in the stderr or stdout, it downloads the package and untars 
it but doesn’t seem to do much after that. Any insights?

Steve


On Apr 24, 2015, at 5:50 PM, Yang Lei 
genia...@gmail.commailto:genia...@gmail.com wrote:

SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST

This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.

Group by order by

Hi,

Could you suggest what is the best way to do group by x order by y in Spark?

When I try to perform it with Spark SQL I get the following error (Spark 1.3):

val results = sqlContext.sql(select * from sample group by id order by time)
org.apache.spark.sql.AnalysisException: expression 'time' is neither present in 
the group by, nor is it an aggregate function. Add to group by or wrap in 
first() if you don't care which value you get.;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)

Is there a way to do it with just RDD?

Best regards, Alexander

Re: Group by order by

2015-04-27 Thread Richard Marscher

Hi,

that error seems to indicate the basic query is not properly expressed. If
you group by just ID, then that means it would need to aggregate all the
time values into one value per ID, so you can't sort by it. Thus it tries
to suggest an aggregate function for time so you can have 1 value per ID
and properly sort it.

On Mon, Apr 27, 2015 at 3:07 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi,



 Could you suggest what is the best way to do “group by x order by y” in
 Spark?



 When I try to perform it with Spark SQL I get the following error (Spark
 1.3):



 val results = sqlContext.sql(select * from sample group by id order by
 time)

 org.apache.spark.sql.AnalysisException: expression 'time' is neither
 present in the group by, nor is it an aggregate function. Add to group by
 or wrap in first() if you don't care which value you get.;

 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)



 Is there a way to do it with just RDD?



 Best regards, Alexander

RE: Group by order by