date:20140916

Re: collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy

it also appears in streaming hdfs fileStream



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread Davies Liu

PipelinedRDD is an RDD generated by Python mapper/reducer, such as
rdd.map(func) will be PipelinedRDD.

PipelinedRDD is an subclass of RDD, so it should have all the APIs which
RDD has.

>>> sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count()
10

I'm wondering that how can you trigger this error?

Davies

On Tue, Sep 16, 2014 at 10:03 PM, edmond_huo  wrote:
> Hi,
>
> I am a freshman about spark. I tried to run a job like wordcount example in
> python. But when I tried to get the top 10 popular words in the file, I got
> the message:AttributeError: 'PipelinedRDD' object has no attribute
> 'sortByKey'.
>
> So my question is what is the difference between PipelinedRDD and RDD? and
> if I want to sort the data in PipelinedRDD, how can I do it?
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/The-difference-between-pyspark-rdd-PipelinedRDD-and-pyspark-rdd-RDD-tp14421.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: CPU RAM

2014-09-16 Thread Akhil Das

Ganglia does give you a cluster wide and per machine utilization of
resources, but i don't think it gives your per Spark Job. If you want to
build something from scratch then you can follow up like :

1. Login to the machine
2. Get the PIDs
3. For network IO per process, you can have a look at
http://nethogs.sourceforge.net/
4. You can make use of the information in /proc/[pid]/stat and /proc/stat
to estimate CPU usage and all

Similarly you can get any metric of process once you have the PID.

Thanks
Best Regards

On Wed, Sep 17, 2014 at 8:59 AM, VJ Shalish  wrote:

> Sorry for the confusion Team.
> My requirement is to measure the CPU utilisation, RAM usage, Network IO
> and other metrics of a SPARK JOB using Java program.
> Please help on the same.
>
> On Tue, Sep 16, 2014 at 11:23 PM, Amit  wrote:
>
>> Not particularly related to Spark, but you can check out SIGAR API. It
>> let's you get CPU, Memory, Network, Filesystem and process based metrics.
>>
>> Amit
>> On Sep 16, 2014, at 20:14, VJ Shalish  wrote:
>>
>> > Hi
>> >
>> > I need to get the CPU utilisation, RAM usage, Network IO and other
>> metrics using Java program. Can anyone help me on this?
>> >
>> > Thanks
>> > Shalish.
>>
>
>

permission denied on local dir

2014-09-16 Thread style95

I am running spark on shared yarn cluster.
My user ID is "online", but I found that when I run my spark application,
local directories are created by "yarn" user ID.
So I am unable to delete local directories and finally application failed.

Please refer to my log below:

14/09/16 21:59:02 ERROR DiskBlockManager: Exception while deleting local
spark dir:
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7
java.io.IOException: Failed to list files for dir:
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7/3a
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:580)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:592)
at
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:593)
at
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:592)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:592)
at
org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:163)
at
org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:160)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:160)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:153)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:151)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:151)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:151)


I am unable to access
"/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7"
 
e.g) "ls
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7"
does not work and permission denied occurred.

I am using spark-1.0.0 and yarn 2.4.0.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/permission-denied-on-local-dir-tp14422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread edmond_huo

Hi,

I am a freshman about spark. I tried to run a job like wordcount example in
python. But when I tried to get the top 10 popular words in the file, I got
the message:AttributeError: 'PipelinedRDD' object has no attribute
'sortByKey'.

So my question is what is the difference between PipelinedRDD and RDD? and
if I want to sort the data in PipelinedRDD, how can I do it?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-difference-between-pyspark-rdd-PipelinedRDD-and-pyspark-rdd-RDD-tp14421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

YARN mode not available error

2014-09-16 Thread Barrington

Hi,

I am running Spark in cluster mode with Hadoop YARN as the underlying
cluster manager. I get this error when trying to initialize the
SparkContext. 


Exception in thread "main" org.apache.spark.SparkException: YARN mode not
available ?
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1586)
at org.apache.spark.SparkContext.(SparkContext.scala:310)
at org.apache.spark.SparkContext.(SparkContext.scala:86)
at LascoScript$.main(LascoScript.scala:24)
at LascoScript.main(LascoScript.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1580)




My build.sbt file  looks like this:



name := "LascoScript"

version := "1.0"

scalaVersion := "2.10.4"

val excludeJBossNetty = ExclusionRule(organization = "org.jboss.netty")
val excludeMortbayJetty = ExclusionRule(organization = "org.eclipse.jetty",
artifact = "jetty-server")
val excludeAsm = ExclusionRule(organization = "org.ow2.asm")
val excludeCommonsLogging = ExclusionRule(organization = "commons-logging")
val excludeSLF4J = ExclusionRule(organization = "org.slf4j")
val excludeOldAsm = ExclusionRule(organization = "asm")
val excludeServletApi = ExclusionRule(organization = "javax.servlet",
artifact = "servlet-api")


libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
excludeAll(
 excludeServletApi, excludeMortbayJetty
)

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.5.1"
excludeAll(
 excludeJBossNetty, excludeMortbayJetty, excludeAsm, excludeCommonsLogging,
excludeSLF4J, excludeOldAsm, excludeServletApi
 )

libraryDependencies += "org.mortbay.jetty" % "servlet-api" % "3.0.20100224"

libraryDependencies += "org.eclipse.jetty" % "jetty-server" %
"8.1.16.v20140903"

unmanagedJars in Compile ++= {
 val base = baseDirectory.value
 val baseDirectories = (base / "lib") +++ (base)
 val customJars = (baseDirectories ** "*.jar")
 customJars.classpath
}

resolvers += "Akka Repository" at "http://repo.akka.io/releases/“



How can I fix this issue?

- Barrington



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-mode-not-available-error-tp14420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: CPU RAM

2014-09-16 Thread VJ Shalish

Sorry for the confusion Team.
My requirement is to measure the CPU utilisation, RAM usage, Network IO and
other metrics of a SPARK JOB using Java program.
Please help on the same.

On Tue, Sep 16, 2014 at 11:23 PM, Amit  wrote:

> Not particularly related to Spark, but you can check out SIGAR API. It
> let's you get CPU, Memory, Network, Filesystem and process based metrics.
>
> Amit
> On Sep 16, 2014, at 20:14, VJ Shalish  wrote:
>
> > Hi
> >
> > I need to get the CPU utilisation, RAM usage, Network IO and other
> metrics using Java program. Can anyone help me on this?
> >
> > Thanks
> > Shalish.
>

Re: CPU RAM

2014-09-16 Thread VJ Shalish

Thank u for the response Amit.
So is it that, we cannot measure the CPU consumption, RAM usage of a spark
job through a Java program?

On Tue, Sep 16, 2014 at 11:23 PM, Amit  wrote:

> Not particularly related to Spark, but you can check out SIGAR API. It
> let's you get CPU, Memory, Network, Filesystem and process based metrics.
>
> Amit
> On Sep 16, 2014, at 20:14, VJ Shalish  wrote:
>
> > Hi
> >
> > I need to get the CPU utilisation, RAM usage, Network IO and other
> metrics using Java program. Can anyone help me on this?
> >
> > Thanks
> > Shalish.
>

Re: CPU RAM

2014-09-16 Thread Amit

Not particularly related to Spark, but you can check out SIGAR API. It let's 
you get CPU, Memory, Network, Filesystem and process based metrics.

Amit
On Sep 16, 2014, at 20:14, VJ Shalish  wrote:

> Hi
>  
> I need to get the CPU utilisation, RAM usage, Network IO and other metrics 
> using Java program. Can anyone help me on this?
>  
> Thanks
> Shalish.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread Davies Liu

Yes, sc.addFile() is what you want:

 |  addFile(self, path)
 |  Add a file to be downloaded with this Spark job on every node.
 |  The C{path} passed can be either a local file, a file in HDFS
 |  (or other Hadoop-supported filesystems), or an HTTP, HTTPS or
 |  FTP URI.
 |
 |  To access the file in Spark jobs, use
 |  L{SparkFiles.get(fileName)} with the
 |  filename to find its download location.
 |
 |  >>> from pyspark import SparkFiles
 |  >>> path = os.path.join(tempdir, "test.txt")
 |  >>> with open(path, "w") as testFile:
 |  ...testFile.write("100")
 |  >>> sc.addFile(path)
 |  >>> def func(iterator):
 |  ...with open(SparkFiles.get("test.txt")) as testFile:
 |  ...fileVal = int(testFile.readline())
 |  ...return [x * fileVal for x in iterator]
 |  >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 |  [100, 200, 300, 400]

On Tue, Sep 16, 2014 at 7:02 PM, daijia  wrote:
> Is there some way to ship textfile just like ship python libraries？
>
> Thanks in advance
> Daijia
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

CPU RAM

2014-09-16 Thread VJ Shalish

Hi

I need to get the CPU utilisation, RAM usage, Network IO and other metrics
using Java program. Can anyone help me on this?

Thanks
Shalish.

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.


Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I will 
try that instead.


Is it possible to also represent '--driver-memory' and 
'--executor-memory' (and basically all properties)

using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a custom 
PYTHONSTARTUP script that I use
to launch *bpython* or *python* or my *WING python IDE* with. That 
script reads a python *dict* (from a file)
containing key/value pairs from which it constructs the 
"--driver-java-options ...", which I will now
switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and 
so on), instead.


If all of the properties could be represented in this way, then it makes 
the code cleaner (all in

the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, 
instead of user the driver options, it's better to pass your spark 
properties using the "conf" option.


E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M


Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G 
--executor-memory 5G


-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. 
mailto:subscripti...@didata.us>> wrote:




Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.

Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the
following one-line in the 'pyspark' script to
show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
that exports this variable.

=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1Gxxx  <--- echo statement show option
truncation.

While this succeeds in getting to a pyspark shell prompt (sc), the
context isn't setup properly because, as seen
in red above and below, all but the first option took effect.
(Note spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
'-Dspark.app.name =PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:4040'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
'-Dspark.fileserver.uri=http://192.168.0.16:60305'
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
--jar  null  --arg 'dstorm:44616' --executor-memory 1024
--executor-cores 1 --num-executors 2 1> /stdout 2>
/stderr

(Note: I happen to notice that 'spark.driver.memory' is missing as
well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options
'-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options
"-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

While this does have all the options (shown in the red echo output
above and the command executed below), pyspark invocation fails,
indicating
that the application ended before I got to a shell prompt.
S

Re: how to report documentation bug?

2014-09-16 Thread Nicholas Chammas

You can send an email like you just did or open an issue in the Spark issue
tracker . This looks like a problem with
how the version is generated in this file
.

On Tue, Sep 16, 2014 at 8:55 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

>
>
> http://spark.apache.org/docs/latest/quick-start.html#standalone-applications
>
> Click on java tab There is a bug in the maven section
>
>   1.1.0-SNAPSHOT
>
>
>
> Should be
> 1.1.0
>
> Hope this helps
>
> Andy
>

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread daijia

Is there some way to ship textfile just like ship python libraries？

Thanks in advance
Daijia



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao

Thank you Yin Huai. This is probably true.

I saw in the hive-site.xml, Liu has changed the entry, which is default should 
be false.

  
hive.support.concurrency
Enable Hive's Table Lock Manager Service
true
  

Someone is working on upgrading the Hive to 0.13 for SparkSQL 
(https://github.com/apache/spark/pull/2241), not sure if you can wait for this. 
☺

From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Wednesday, September 17, 2014 1:50 AM
To: Cheng, Hao
Cc: linkpatrickliu; u...@spark.incubator.apache.org
Subject: Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

I meant it may be a Hive bug since we also call Hive's drop table internally.

On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai 
mailto:huaiyin@gmail.com>> wrote:
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
Thank you for pasting the steps, I will look at this, hopefully come out with a 
solution soon.

-Original Message-
From: linkpatrickliu 
[mailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when "DROP" or "LOAD"
Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect to 
Zookeeper.

However, I found some more interesting things. And I think I have found a bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database "use dw_op1"; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired locks.

(2) Drop table "drop table src;" (Blocked) The logs show that the thriftServer 
is "acquireReadWriteLocks".

Doubt:
The reason why I cannot drop table src is because the first SQL "use dw_op1"
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the default 
database;
(0) Restart thriftServer & use beeline to connect to thriftServer.
(1) Drop table "drop table src"; (OK)
Amazing! Succeed!
(2) Drop again!  "drop table src2;" (Blocked) Same error: the thriftServer is 
blocked in the "acquireReadWriteLocks"
phrase.

As you can see.
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks correctly 
in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

how to report documentation bug?

2014-09-16 Thread Andy Davidson


http://spark.apache.org/docs/latest/quick-start.html#standalone-applications

Click on java tab There is a bug in the maven section

  1.1.0-SNAPSHOT
 

Should be 
1.1.0

Hope this helps

Andy

Re: Memory under-utilization

2014-09-16 Thread francisco

Thanks for the tip.

http://localhost:4040/executors/ is showing 
Executors(1)
Memory: 0.0 B used (294.9 MB Total)
Disk: 0.0 B Used

However, running as standalone cluster does resolve the problem.
I can see a worker process running w/ the allocated memory.

My conclusion (I may be wrong) is for 'local' mode the 'executor-memory'
parameter is not honored.

Thanks again for the help!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Sandy Ryza

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, instead of
user the driver options, it's better to pass your spark properties using
the "conf" option.

E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M

Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
--executor-memory 5G

-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <
subscripti...@didata.us> wrote:

>
>
> Hello friends:
>
> Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution.
> Everything went fine, and everything seems
> to work, but for the following.
>
> Following are two invocations of the 'pyspark' script, one with enclosing
> quotes around the options passed to
> '--driver-java-options', and one without them. I added the following
> one-line in the 'pyspark' script to
> show my problem...
>
> ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
> exports this variable.
>
> =
>
> FIRST:
> [ without enclosing quotes ]:
>
> user@linux$ pyspark --master yarn-client --driver-java-options
> -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
> xxx --master yarn-client --driver-java-options
> -Dspark.executor.memory=1Gxxx  <--- echo statement show option truncation.
>
> While this succeeds in getting to a pyspark shell prompt (sc), the context
> isn't setup properly because, as seen
> in red above and below, all but the first option took effect. (Note
> spark.executor.memory is correct but that's only because
> my spark defaults coincide with it.)
>
> 14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server
> -Xmx512m -Djava.io.tmpdir=$PWD/tmp
> '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
> '-Dspark.submit.pyFiles='
> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
> '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-
> Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040'
> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
> '-Dspark.fileserver.uri=http://192.168.0.16:60305'
> '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
> null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 
> --num-executors
> 2 1> /stdout 2> /stderr
>
> (Note: I happen to notice that 'spark.driver.memory' is missing as well).
>
> ===
>
> NEXT:
>
> [ So let's try with enclosing quotes ]
> user@linux$ pyspark --master yarn-client --driver-java-options
> '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
> xxx --master yarn-client --driver-java-options "-Dspark.executor.memory=1G
> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>
> While this does have all the options (shown in the red echo output above
> and the command executed below), pyspark invocation fails, indicating
> that the application ended before I got to a shell prompt.
> See below snippet.
>
> 14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server
> -Xmx512m -Djava.io.tmpdir=$PWD/tmp
> '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
> '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
> '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3'
> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
> '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468'
> '-Dspark.driver.host=dstorm'
> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
> '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
> '-Dspark.driver.appUIAddress=dstorm:8468' '-D
> spark.yarn.executor.memoryOverhead=512M'
> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell

If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.

On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya  wrote:
> I have a use case where my RDD is set up such:
>
> Partition 0:
> K1 -> [V1, V2]
> K2 -> [V2]
>
> Partition 1:
> K3 -> [V1]
> K4 -> [V3]
>
> I want to invert this RDD, but only within a partition, so that the
> operation does not require a shuffle.  It doesn't matter if the partitions
> of the inverted RDD have non unique keys across the partitions, for example:
>
> Partition 0:
> V1 -> [K1]
> V2 -> [K1, K2]
>
> Partition 1:
> V1 -> [K3]
> V3 -> [K4]
>
> Is there a way to do only a per-partition groupBy, instead of shuffling the
> entire data?
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

partitioned groupBy

2014-09-16 Thread Akshat Aranya

I have a use case where my RDD is set up such:

Partition 0:
K1 -> [V1, V2]
K2 -> [V2]

Partition 1:
K3 -> [V1]
K4 -> [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 -> [K1]
V2 -> [K1, K2]

Partition 1:
V1 -> [K3]
V3 -> [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.




Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN 
distribution. Everything went fine, and everything seems

to work, but for the following.

Following are two invocations of the 'pyspark' script, one with 
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the following 
one-line in the 'pyspark' script to

show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that 
exports this variable.


=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options 
-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options 
-Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.


While this succeeds in getting to a pyspark shell prompt (sc), the 
context isn't setup properly because, as seen
in red above and below, all but the first option took effect. (Note 
spark.executor.memory is correct but that's only because

my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' 
'-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' 
'-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' 
'-Dspark.submit.pyFiles=' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' 
'-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:4040' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' 
'-Dspark.fileserver.uri=http://192.168.0.16:60305' 
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 
--num-executors  2 1> /stdout 2> /stderr


(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options 
'-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options 
"-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx


While this does have all the options (shown in the red echo output above 
and the command executed below), pyspark invocation fails, indicating

that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' 
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' 
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.serializer.objectStreamReset=100' 
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' 
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' 
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:8468' 
'-Dspark.yarn.executor.memoryOverhead=512M' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G 
-Dspark.ui.port=8468 -Dspark.driver.memory=512M 
-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.fileserver.uri=http://192.168.0.16:54171' 
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1 
--num-executors  3 1> /stdout 2> /stderr



[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 1410903852044
 yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 1410903852044
 yarnA

Re: org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Aris

This should be a really simple problem, but you haven't shared enough code
to determine what's going on here.

On Tue, Sep 16, 2014 at 8:08 AM, Hui Li  wrote:

> Hi,
>
> I am new to SPARK. I just set up a small cluster and wanted to run some
> simple MLLIB examples. By following the instructions of
> https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
> I could successfully run everything until the step of SVMWithSGD, I got
> error the following message. I don't know why the
>  file:/root/test/sample_svm_data.txt does not exist since I already read it
> out, printed it and converted into the labeled data and passed the parsed
> data to the function SvmWithSGD.
>
> Any one have the same issue with me?
>
> Thanks,
>
> Emily
>
>  val model = SVMWithSGD.train(parsedData, numIterations)
> 14/09/16 10:55:21 INFO SparkContext: Starting job: first at
> GeneralizedLinearAlgorithm.scala:121
> 14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
> GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
> (allowLocal=true)
> 14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
> GeneralizedLinearAlgorithm.scala:121)
> 14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
> 14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
> 14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
> locally
> 14/09/16 10:55:21 INFO HadoopRDD: Input split:
> file:/root/test/sample_svm_data.txt:0+19737
> 14/09/16 10:55:21 INFO SparkContext: Job finished: first at
> GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
> 14/09/16 10:55:21 INFO SparkContext: Starting job: count at
> DataValidators.scala:37
> 14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
> DataValidators.scala:37) with 2 output partitions (allowLocal=false)
> 14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
> DataValidators.scala:37)
> 14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
> 14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
> 14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
> at filter at DataValidators.scala:37), which has no missing parents
> 14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 12 (FilteredRDD[26] at filter at DataValidators.scala:37)
> 14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
> 14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
> executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
> 14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733
> bytes in 0 ms
> 14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
> executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
> 14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733
> bytes in 0 ms
> 14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
> 14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
> java.io.FileNotFoundException
> java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
> does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:156)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at
> org.apache.spark.scheduler.Re

MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-16 Thread Aris

Hello Spark Community -

I am using the support vector machine / SVM implementation in MLlib with
the standard linear kernel; however, I noticed in the Spark documentation
for StandardScaler is *specifically* mentions that SVMs which use the RBF
kernel work really well when you have standardized data...

which begs the question, is there some kind of support for RBF kernels
rather than linear kernels? In small data tests using R the RBF kernel
worked really well, and linear kernel never converged...so I would really
like to use RBF.

Thank you folks for any help!

Aris

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas

I see, what does http://localhost:4040/executors/ show for memory usage?

I personally find it easier to work with a standalone cluster with a single
worker by using the sbin/start-master.sh and then connecting to the master.

On Tue, Sep 16, 2014 at 6:04 PM, francisco  wrote:

> Thanks for the reply.
>
> I doubt that's the case though ...  the executor kept having to do a file
> dump because memory is full.
>
> ...
> 14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
> MB to disk (668 times so far)
> 14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling in-memory map of 66
> MB to disk (669 times so far)
> 14/09/16 15:00:24 WARN ExternalAppendOnlyMap: Spilling in-memory map of 70
> MB to disk (670 times so far)
> 14/09/16 15:00:31 WARN ExternalAppendOnlyMap: Spilling in-memory map of 127
> MB to disk (671 times so far)
> 14/09/16 15:00:43 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
> MB to disk (672 times so far)
> ...
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14399.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Aris

Yeah - another vote here to do what's called One-Hot encoding, just convert
the single categorical feature into N columns, where N is the number of
distinct values of that feature, with a single one and all the other
features/columns set to zero.

On Tue, Sep 16, 2014 at 2:16 PM, Sean Owen  wrote:

> I think it's on the table but not yet merged?
> https://issues.apache.org/jira/browse/SPARK-1216
>
> On Tue, Sep 16, 2014 at 10:04 PM, st553  wrote:
> > Does MLlib provide utility functions to do this kind of encoding?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

How do I manipulate values outside of a GraphX loop?

2014-09-16 Thread crockpotveggies

Brand new to Apache Spark and I'm a little confused how to make updates to a
value that sits outside of a .mapTriplets iteration in GraphX. I'm aware
mapTriplets is really only for modifying values inside the graph. What about
using it in conjunction with other computations? See below:

def mapTripletsMethod(edgeWeights: Graph[Int, Double],
stationaryDistribution: Graph[Double, Double]) = {
  val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)

  stationaryDistribution.mapTriplets{ e =>
  val row = e.srcId.toInt
  val column = e.dstId.toInt
  var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
  tempMatrix.set(row, column, cellValue) // this doesn't do anything to
tempMatrix
  e
}
}

My first guess is that since the input param for mapTriplets is a function,
then somehow the value for tempMatrix is getting lost or duplicated.

If this really isn't the proper design pattern to do this, does anyone have
any insight on the "right way"? Implementing some of the more complex
centrality algorithms is rather difficult until I can wrap my head around
this paradigm.

Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-manipulate-values-outside-of-a-GraphX-loop-tp14400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory under-utilization

2014-09-16 Thread francisco

Thanks for the reply.

I doubt that's the case though ...  the executor kept having to do a file
dump because memory is full.

...
14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
MB to disk (668 times so far)
14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling in-memory map of 66
MB to disk (669 times so far)
14/09/16 15:00:24 WARN ExternalAppendOnlyMap: Spilling in-memory map of 70
MB to disk (670 times so far)
14/09/16 15:00:31 WARN ExternalAppendOnlyMap: Spilling in-memory map of 127
MB to disk (671 times so far)
14/09/16 15:00:43 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
MB to disk (672 times so far)
...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Questions about Spark speculation

2014-09-16 Thread Nicolas Mai

Hi, guys

My current project is using Spark 0.9.1, and after increasing the level of
parallelism and partitions in our RDDs, stages and tasks seem to complete
much faster. However it also seems that our cluster becomes more "unstable"
after some time:
- stalled stages still showing under "active stages" in the Spark app web
dashboard
- incomplete stages showing under "completed stages"
- stages with failures

I was thinking about reducing/tuning the number of parallelism, but I was
also considering using "spark.speculation" which is currently turned off but
seems promising.

Questions about speculation:
- Just wondering why it is turned off by default?
- Are there any risks using speculation?
- Is it possible that a speculative task straggles, and would trigger
another new speculative task to finish the job... and so on... (some kind of
loop until there's no more executors available).
- What configuration do you guys usually use for spark.speculation?
(interval, quantile, multiplier) I guess it depends on the project, it may
give some ideas about how to use it properly.

Thank you! :)
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-speculation-tp14398.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas

Perhaps your job does not use more than 9g. Even though the dashboard shows
64g the process only uses whats needed and grows to 64g max.

On Tue, Sep 16, 2014 at 5:40 PM, francisco  wrote:

> Hi, I'm a Spark newbie.
>
> We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb
> memory
> and 48 cores.
>
> Tried to allocate a task with 64gb memory but for whatever reason Spark is
> only using around 9gb max.
>
> Submitted spark job with the following command:
> "
> /bin/spark-submit -class SimpleApp --master local[16] --executor-memory 64G
> /var/tmp/simple-project_2.10-1.0.jar /data/lucene/ns.gz
> "
>
> When I run 'top' command I see only 9gb of memory is used by the spark
> process
>
> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 3047005 fran  30  10 8785m 703m  18m S 112.9  0.3  48:19.63 java
>
>
> Any idea why this is happening? I've also tried to set the memory
> programatically using
> " new SparkConf().set("spark.executor.memory", "64g") " but that also
> didn't
> do anything.
>
> Is there some limitation when running in 'local' mode?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Memory under-utilization

2014-09-16 Thread francisco

Hi, I'm a Spark newbie.

We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb memory
and 48 cores. 

Tried to allocate a task with 64gb memory but for whatever reason Spark is
only using around 9gb max.

Submitted spark job with the following command:
"
/bin/spark-submit -class SimpleApp --master local[16] --executor-memory 64G
/var/tmp/simple-project_2.10-1.0.jar /data/lucene/ns.gz
"

When I run 'top' command I see only 9gb of memory is used by the spark
process

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
3047005 fran  30  10 8785m 703m  18m S 112.9  0.3  48:19.63 java


Any idea why this is happening? I've also tried to set the memory
programatically using
" new SparkConf().set("spark.executor.memory", "64g") " but that also didn't
do anything.

Is there some limitation when running in 'local' mode?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen

I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216

On Tue, Sep 16, 2014 at 10:04 PM, st553  wrote:
> Does MLlib provide utility functions to do this kind of encoding?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553

Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Indexed RDD

2014-09-16 Thread Akshat Aranya

Hi,

I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way iterate through all elements of a partition and find the matching
ones.  Is there a better way to do what I want to do, short of just
implementing new methods on the custom RDD?

Thanks,
Akshat

Spark processing small files.

2014-09-16 Thread cem

Hi all,

Spark is taking too much time to start the first stage with many small
files in HDFS.

I am reading a folder that contains RC files:

sc.hadoopFile("hdfs://hostname :8020/test_data2gb/",
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
classOf[LongWritable], classOf[BytesRefArrayWritable])

And parse:
 val parsedData = file.map((tuple: (LongWritable, BytesRefArrayWritable))
=>  RCFileUtil.getData(tuple._2))

620 3mb files (2Gb total) takes considerable more time to start the first
stage than 200 40mb 8gb total.


Do you have any idea about the reason? Thanks!

Best Regards,
Cem Cayiroglu

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais

Hi Sean,

Great catch! Yes I was including Spark as a dependency and it was
making its way into my uber jar.  Following the advice I just found at
Stackoverflow[1],  I marked Spark as a provided dependency and that
appeared to fix my Hadoop client issue.  Thanks for your help!!!
Perhaps they maintainers might consider setting this in the Quickstart
guide pom.xml ( http://spark.apache.org/docs/latest/quick-start.html )

In summary, here's what worked:
 * Hadoop 2.3 cdh5
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
 * Spark 1.1 for Hadoop 2.3
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz

pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9

[1] 
http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run

Thanks everybody!!
-Paul


On Tue, Sep 16, 2014 at 3:55 AM, Sean Owen  wrote:
> From the caller / application perspective, you don't care what version
> of Hadoop Spark is running on on the cluster. The Spark API you
> compile against is the same. When you spark-submit the app, at
> runtime, Spark is using the Hadoop libraries from the cluster, which
> are the right version.
>
> So when you build your app, you mark Spark as a 'provided' dependency.
> Therefore in general, no, you do not build Spark for yourself if you
> are a Spark app creator.
>
> (Of course, your app would care if it were also using Hadoop libraries
> directly. In that case, you will want to depend on hadoop-client, and
> the right version for your cluster, but still mark it as provided.)
>
> The version Spark is built against only matters when you are deploying
> Spark's artifacts on the cluster to set it up.
>
> Your error suggests there is still a version mismatch. Either you
> deployed a build that was not compatible, or, maybe you are packaging
> a version of Spark with your app which is incompatible and
> interfering.
>
> For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
> suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
> your app, when it shouldn't be packaged.
>
> Spark works out of the box with just about any modern combo of HDFS and YARN.
>
> On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais  wrote:
>> Dear List,
>>
>> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
>> reading SequenceFiles.  In particular, I'm seeing:
>>
>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>> Server IPC version 7 cannot communicate with client version 4
>> at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>> at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>> ...
>>
>> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
>> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
>> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
>> the unit test here:
>> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
>> )
>>
>>
>> This error indicates to me that Spark is using an old hadoop client to
>> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
>> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>>
>>
>> Do I need to explicitly build spark for modern hadoop??  I previously
>> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
>> error (server is using version 9, client is using version 4).
>>
>>
>> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
>> on spark's site:
>>  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>>  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>>
>>
>> What distro of hadoop is used at Data Bricks?  Are there distros of
>> Spark 1.1 and hadoop that should work together out-of-the-box?
>> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>>
>> Thanks for any help anybody can give me here!
>> -Paul
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

R: Spark as a Library

2014-09-16 Thread Paolo Platter

Hi,

Spark job server by ooyala is the right tool for the job. It exposes rest api 
so calling it from a web app is suitable.
Is open source, you can find it on github

Best

Paolo Platter

Da: Ruebenacker, Oliver A
Inviato: ‎16/‎09/‎2014 21.18
A: Matei Zaharia; 
user@spark.apache.org
Oggetto: RE: Spark as a Library


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann

You can create a new SparkContext inside your container pointed to your
master. However, for your script to run you must call addJars to put the
code on your workers' classpaths (except when running locally).

Hopefully your webapp has some lib folder which you can point to as a
source for the jars. In the Play Framework you can use
play.api.Play.application.getFile("lib") to get a path to the lib directory
and get the contents. Of course that only works on the packaged web app.

On Tue, Sep 16, 2014 at 3:17 PM, Ruebenacker, Oliver A <
oliver.ruebenac...@altisource.com> wrote:

>
>
>  Hello,
>
>
>
>   Thanks for the response and great to hear it is possible. But how do I
> connect to Spark without using the submit script?
>
>
>
>   I know how to start up a master and some workers and then connect to the
> master by packaging the app that contains the SparkContext and then
> submitting the package with the spark-submit script in standalone-mode. But
> I don’t want to submit the app that contains the SparkContext via the
> script, because I want that app to be running on a web server. So, what are
> other ways to connect to Spark? I can’t find in the docs anything other
> than using the script. Thanks!
>
>
>
>  Best, Oliver
>
>
>
> *From:* Matei Zaharia [mailto:matei.zaha...@gmail.com]
> *Sent:* Tuesday, September 16, 2014 1:31 PM
> *To:* Ruebenacker, Oliver A; user@spark.apache.org
> *Subject:* Re: Spark as a Library
>
>
>
> If you want to run the computation on just one machine (using Spark's
> local mode), it can probably run in a container. Otherwise you can create a
> SparkContext there and connect it to a cluster outside. Note that I haven't
> tried this though, so the security policies of the container might be too
> restrictive. In that case you'd have to run the app outside and expose an
> RPC interface between them.
>
>
>
> Matei
>
>
>
> On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A (
> oliver.ruebenac...@altisource.com) wrote:
>
>
>
>  Hello,
>
>
>
>   Suppose I want to use Spark from an application that I already submit to
> run in another container (e.g. Tomcat). Is this at all possible? Or do I
> have to split the app into two components, and submit one to Spark and one
> to the other container? In that case, what is the preferred way for the two
> components to communicate with each other? Thanks!
>
>
>
>  Best, Oliver
>
>
>
> Oliver Ruebenacker | Solutions Architect
>
>
>
> Altisource™
>
> 290 Congress St, 7th Floor | Boston, Massachusetts 02210
>
> P: (617) 728-5582 | ext: 275585
>
> oliver.ruebenac...@altisource.com | www.Altisource.com
>
>
>
> ***
>
>
> This email message and any attachments are intended solely for the use of
> the addressee. If you are not the intended recipient, you are prohibited
> from reading, disclosing, reproducing, distributing, disseminating or
> otherwise using this transmission. If you have received this message in
> error, please promptly notify the sender by reply email and immediately
> delete this message from your system. This message and any attachments
> may contain information that is confidential, privileged or exempt from
> disclosure. Delivery of this message to any person other than the intended
> recipient is not intended to waive any right or privilege. Message
> transmission is not guaranteed to be secure or free of software viruses.
>
> ***
>
>
> ***
>
> This email message and any attachments are intended solely for the use of
> the addressee. If you are not the intended recipient, you are prohibited
> from reading, disclosing, reproducing, distributing, disseminating or
> otherwise using this transmission. If you have received this message in
> error, please promptly notify the sender by reply email and immediately
> delete this message from your system.
> This message and any attachments may contain information that is
> confidential, privileged or exempt from disclosure. Delivery of this
> message to any person other than the intended recipient is not intended to
> waive any right or privilege. Message transmission is not guaranteed to be
> secure or free of software viruses.
>
> ***
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: NullWritable not serializable

2014-09-16 Thread Du Li

Hi,

The test case is separated out as follows. The call to rdd2.first() breaks when 
spark version is changed to 1.1.0, reporting exception NullWritable not 
serializable. However, the same test passed with spark 1.0.2. The pom.xml file 
is attached. The test data README.md was copied from spark.

Thanks,
Du
-

package com.company.project.test

import org.scalatest._

class WritableTestSuite extends FunSuite {
  test("generated sequence file should be readable from spark") {
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

val conf = new SparkConf(false).setMaster("local").setAppName("test data 
exchange with spark")
val sc = new SparkContext(conf)

val rdd = sc.textFile("README.md")
val res = rdd.map(x => (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile("./test_data")

val rdd2 = sc.sequenceFile("./test_data", classOf[NullWritable], 
classOf[Text])

assert(rdd.first == rdd2.first._2.toString)
  }
}



From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Date: Monday, September 15, 2014 at 10:52 PM
To: Du Li mailto:l...@yahoo-inc.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>, 
"d...@spark.apache.org" 
mailto:d...@spark.apache.org>>
Subject: Re: NullWritable not serializable

Can you post the exact code for the test that worked in 1.0? I can't think of 
much that could've changed. The one possibility is if  we had some operations 
that were computed locally on the driver (this happens with things like first() 
and take(), which will try to do the first partition locally). But generally 
speaking these operations should *not* work over a network, so you'll have to 
make sure that you only send serializable types through shuffles or collects, 
or use a serialization framework like Kryo that might be okay with Writables.

Matei


On September 15, 2014 at 9:13:13 PM, Du Li 
(l...@yahoo-inc.com) wrote:

Hi Matei,

Thanks for your reply.

The Writable classes have never been serializable and this is why it is weird. 
I did try as you suggested to map the Writables to integers and strings. It 
didn’t pass, either. Similar exceptions were thrown except that the messages 
became IntWritable, Text are not serializable. The reason is in the implicits 
defined in the SparkContext object that convert those values into their 
corresponding Writable classes before saving the data in sequence file.

My original code was actual some test cases to try out SequenceFile related 
APIs. The tests all passed when the spark version was specified as 1.0.2. But 
this one failed after I changed the spark version to 1.1.0 the new release, 
nothing else changed. In addition, it failed when I called rdd2.collect(), 
take(1), and first(). But it worked fine when calling rdd2.count(). As you can 
see, count() does not need to serialize and ship data while the other three 
methods do.

Do you recall any difference between spark 1.0 and 1.1 that might cause this 
problem?

Thanks,
Du


From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Date: Friday, September 12, 2014 at 9:10 PM
To: Du Li mailto:l...@yahoo-inc.com.invalid>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>, 
"d...@spark.apache.org" 
mailto:d...@spark.apache.org>>
Subject: Re: NullWritable not serializable

Hi Du,

I don't think NullWritable has ever been serializable, so you must be doing 
something differently from your previous program. In this case though, just use 
a map() to turn your Writables to serializable types (e.g. null and String).

Matie


On September 12, 2014 at 8:48:36 PM, Du Li 
(l...@yahoo-inc.com.invalid) wrote:

Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile("README.md")
val res = rdd.map(x => (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile("./test_data")
val rdd2 = sc.sequenceFile("./test_data", classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text]("./test_data")
rdd3.collect




pom.xml
Description: pom.xml

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

RDD projection and sorting

2014-09-16 Thread Sameer Tilak


Hi All,
I have data in for following format:L
1st column is userid and the second column onward are class ids for various 
products. I want to save this in Libsvm format and an intermediate step is to 
sort (in ascending manner) the class ids. For example: I/Puid1   12433580   
 2670122 259317821526uid2   121 285 24471516343 
385 1200 912 143058241451893112711088
258416645481
Desired O/P:uid1   122  1243  1526  1782  2593  2670  3580uid2  
 121 285 343 385   912  1088   1200   1271   1430   1451   1516   
1664   2447   25845481  5824   8931   
Can someone please point me in the right direction. How do I project 
if I use val data = sc.textFile(..)How do I project column 1 to end (not 
including column 0) and then sort these projected columns.

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai

I meant it may be a Hive bug since we also call Hive's drop table
internally.

On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai  wrote:

> Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?
>
> On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao  wrote:
>
>> Thank you for pasting the steps, I will look at this, hopefully come out
>> with a solution soon.
>>
>> -Original Message-
>> From: linkpatrickliu [mailto:linkpatrick...@live.com]
>> Sent: Tuesday, September 16, 2014 3:17 PM
>> To: u...@spark.incubator.apache.org
>> Subject: RE: SparkSQL 1.1 hang when "DROP" or "LOAD"
>>
>> Hi, Hao Cheng.
>>
>> I have done other tests. And the result shows the thriftServer can
>> connect to Zookeeper.
>>
>> However, I found some more interesting things. And I think I have found a
>> bug!
>>
>> Test procedure:
>> Test1:
>> (0) Use beeline to connect to thriftServer.
>> (1) Switch database "use dw_op1"; (OK)
>> The logs show that the thriftServer connected with Zookeeper and acquired
>> locks.
>>
>> (2) Drop table "drop table src;" (Blocked) The logs show that the
>> thriftServer is "acquireReadWriteLocks".
>>
>> Doubt:
>> The reason why I cannot drop table src is because the first SQL "use
>> dw_op1"
>> have left locks in Zookeeper  unsuccessfully released.
>> So when the second SQL is acquiring locks in Zookeeper, it will block.
>>
>> Test2:
>> Restart thriftServer.
>> Instead of switching to another database, I just drop the table in the
>> default database;
>> (0) Restart thriftServer & use beeline to connect to thriftServer.
>> (1) Drop table "drop table src"; (OK)
>> Amazing! Succeed!
>> (2) Drop again!  "drop table src2;" (Blocked) Same error: the
>> thriftServer is blocked in the "acquireReadWriteLocks"
>> phrase.
>>
>> As you can see.
>> Only the first SQL requiring locks can succeed.
>> So I think the reason is that the thriftServer cannot release locks
>> correctly in Zookeeper.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai

Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao  wrote:

> Thank you for pasting the steps, I will look at this, hopefully come out
> with a solution soon.
>
> -Original Message-
> From: linkpatrickliu [mailto:linkpatrick...@live.com]
> Sent: Tuesday, September 16, 2014 3:17 PM
> To: u...@spark.incubator.apache.org
> Subject: RE: SparkSQL 1.1 hang when "DROP" or "LOAD"
>
> Hi, Hao Cheng.
>
> I have done other tests. And the result shows the thriftServer can connect
> to Zookeeper.
>
> However, I found some more interesting things. And I think I have found a
> bug!
>
> Test procedure:
> Test1:
> (0) Use beeline to connect to thriftServer.
> (1) Switch database "use dw_op1"; (OK)
> The logs show that the thriftServer connected with Zookeeper and acquired
> locks.
>
> (2) Drop table "drop table src;" (Blocked) The logs show that the
> thriftServer is "acquireReadWriteLocks".
>
> Doubt:
> The reason why I cannot drop table src is because the first SQL "use
> dw_op1"
> have left locks in Zookeeper  unsuccessfully released.
> So when the second SQL is acquiring locks in Zookeeper, it will block.
>
> Test2:
> Restart thriftServer.
> Instead of switching to another database, I just drop the table in the
> default database;
> (0) Restart thriftServer & use beeline to connect to thriftServer.
> (1) Drop table "drop table src"; (OK)
> Amazing! Succeed!
> (2) Drop again!  "drop table src2;" (Blocked) Same error: the thriftServer
> is blocked in the "acquireReadWriteLocks"
> phrase.
>
> As you can see.
> Only the first SQL requiring locks can succeed.
> So I think the reason is that the thriftServer cannot release locks
> correctly in Zookeeper.
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob

Yes that was very helpful… ☺

Here are a few more I found on my quest to get HBase working with Spark –

This one details about Hbase dependencies and spark classpaths

http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

This one has a code overview –

http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

All of them were very helpful…



From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Tuesday, September 16, 2014 10:30 AM
To: Jacob, Abraham (Financial&Risk)
Cc: tq00...@gmail.com; user
Subject: Re: HBase and non-existent TableInputFormat

Btw, there are some examples in the Spark GitHub repo that you may find 
helpful. Here's 
one
 related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM, 
mailto:abraham.ja...@thomsonreuters.com>> 
wrote:
Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working –

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code…


SparkConf sconf = new SparkConf().setAppName(“App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set("spark.executor.extraClassPath", "$(hbase classpath)");  
//<=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "Article");


JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job –


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "Article");


JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta

It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.

Embedding Spark inside the container is not a great long term solution IMO
because you may see issues when you want to connect with a Spark cluster.



On Tue, Sep 16, 2014 at 11:16 AM, Ruebenacker, Oliver A <
oliver.ruebenac...@altisource.com> wrote:

>
>
>  Hello,
>
>
>
>   Suppose I want to use Spark from an application that I already submit to
> run in another container (e.g. Tomcat). Is this at all possible? Or do I
> have to split the app into two components, and submit one to Spark and one
> to the other container? In that case, what is the preferred way for the two
> components to communicate with each other? Thanks!
>
>
>
>  Best, Oliver
>
>
>
> Oliver Ruebenacker | Solutions Architect
>
>
>
> Altisource™
>
> 290 Congress St, 7th Floor | Boston, Massachusetts 02210
>
> P: (617) 728-5582 | ext: 275585
>
> oliver.ruebenac...@altisource.com | www.Altisource.com
>
>
>
>
> ***
>
> This email message and any attachments are intended solely for the use of
> the addressee. If you are not the intended recipient, you are prohibited
> from reading, disclosing, reproducing, distributing, disseminating or
> otherwise using this transmission. If you have received this message in
> error, please promptly notify the sender by reply email and immediately
> delete this message from your system.
> This message and any attachments may contain information that is
> confidential, privileged or exempt from disclosure. Delivery of this
> message to any person other than the intended recipient is not intended to
> waive any right or privilege. Message transmission is not guaranteed to be
> secure or free of software viruses.
>
> ***
>

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei

On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) wrote:

 

 Hello,

 

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 

 Best, Oliver

 

Oliver Ruebenacker | Solutions Architect

 

Altisource™

290 Congress St, 7th Floor | Boston, Massachusetts 02210

P: (617) 728-5582 | ext: 275585

oliver.ruebenac...@altisource.com | www.Altisource.com

 

***
This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas

Btw, there are some examples in the Spark GitHub repo that you may find
helpful. Here's one

related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM,  wrote:

>  *Hi, *
>
>
>
> *I had a similar situation in which I needed to read data from HBase and
> work with the data inside of a spark context. After much ggling, I
> finally got mine to work. There are a bunch of steps that you need to do
> get this working – *
>
>
>
> *The problem is that the spark context does not know anything about hbase,
> so you have to provide all the information about hbase classes to both the
> driver code and executor code…*
>
>
>
>
>
> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
>
> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>
>
>
> sparkConf.set("spark.executor.extraClassPath", "$(hbase classpath)");  
> //ç=
> you will need to add this to tell the executor about the classpath for
> HBase.
>
>
>
> Configuration conf = HBaseConfiguration.*create*();
>
> conf.set(*TableInputFormat*.INPUT_TABLE, "Article");
>
>
>
> JavaPairRDD hBaseRDD = sc.
> *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
>
> org.apache.hadoop.hbase.client.Result.*class*);
>
>
>
>
>
> *The when you submit the spark job – *
>
>
>
>
>
> *spark-submit --driver-class-path $(hbase classpath) --jars
> /usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
> --class YourClassName --master local App.jar *
>
>
>
>
>
> Try this and see if it works for you.
>
>
>
>
>
> *From:* Y. Dong [mailto:tq00...@gmail.com]
> *Sent:* Tuesday, September 16, 2014 8:18 AM
> *To:* user@spark.apache.org
> *Subject:* HBase and non-existent TableInputFormat
>
>
>
> Hello,
>
>
>
> I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
> read from hbase. The Java code is attached. However the problem is
> TableInputFormat does not even exist in hbase-client API, is there any
> other way I can read from
>
> hbase? Thanks
>
>
>
> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
>
> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>
>
>
> Configuration conf = HBaseConfiguration.*create*();
>
> conf.set(*TableInputFormat*.INPUT_TABLE, "Article");
>
>
>
> JavaPairRDD hBaseRDD = sc.
> *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
>
> org.apache.hadoop.hbase.client.Result.*class*);
>
>
>
>
>
>
>

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob

Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working -

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code...


SparkConf sconf = new SparkConf().setAppName("App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set("spark.executor.extraClassPath", "$(hbase classpath)");  
//<=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "Article");


JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job -


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I'm currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName("App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "Article");


JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi

Can I load that plugin in spark-shell? Or perhaps due the 2-phase
compilation quasiquotes won't work in shell?

On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra 
wrote:

> Okay, that's consistent with what I was expecting.  Thanks, Matei.
>
> On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia 
> wrote:
>
>> I think the current plan is to put it in 1.2.0, so that's what I meant by
>> "soon". It might be possible to backport it too, but I'd be hesitant to do
>> that as a maintenance release on 1.1.x and 1.0.x since it would require
>> nontrivial changes to the build that could break things on Scala 2.10.
>>
>> Matei
>>
>> On September 15, 2014 at 12:19:04 PM, Mark Hamstra (
>> m...@clearstorydata.com) wrote:
>>
>> Are we going to put 2.11 support into 1.1 or 1.0?  Else "will be in
>> soon" applies to the master development branch, but actually in the Spark
>> 1.2.0 release won't occur until the second half of November at the earliest.
>>
>> On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia 
>> wrote:
>>
>>>  Scala 2.11 work is under way in open pull requests though, so
>>> hopefully it will be in soon.
>>>
>>>  Matei
>>>
>>> On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com)
>>> wrote:
>>>
>>>  ah...thanks!
>>>
>>> On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra 
>>> wrote:
>>>
 No, not yet.  Spark SQL is using org.scalamacros:quasiquotes_2.10.

 On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi 
 wrote:

> Folks,
> I understand Spark SQL uses quasiquotes. Does that mean Spark has now
> moved to Scala 2.11?
>
> Mohit.
>


>>>
>>
>

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu

hbase-client module serves client facing APIs.
hbase-server module is supposed to host classes used on server side.

There is still some work to be done so that the above goal is achieved.

On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong  wrote:

> Thanks Ted. It is indeed in hbase-server. Just curious, what’s the
> difference between hbase-client and hbase-server?
>
> On 16 Sep 2014, at 17:01, Ted Yu  wrote:
>
> bq. TableInputFormat does not even exist in hbase-client API
>
> It is in hbase-server module.
>
> Take a look at http://hbase.apache.org/book.html#mapreduce.example.read
>
> On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong  wrote:
>
>> Hello,
>>
>> I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
>> read from hbase. The Java code is attached. However the problem is
>> TableInputFormat does not even exist in hbase-client API, is there any
>> other way I can read from
>> hbase? Thanks
>>
>> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
>> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>>
>> Configuration conf = HBaseConfiguration.*create*();
>> conf.set(TableInputFormat.INPUT_TABLE, "Article");
>>
>> JavaPairRDD hBaseRDD = 
>> sc.newAPIHadoopRDD(conf,
>> TableInputFormat.*class*
>> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
>> org.apache.hadoop.hbase.client.Result.*class*);
>>
>>
>>
>>
>
>

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu

bq. TableInputFormat does not even exist in hbase-client API

It is in hbase-server module.

Take a look at http://hbase.apache.org/book.html#mapreduce.example.read

On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong  wrote:

> Hello,
>
> I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
> read from hbase. The Java code is attached. However the problem is
> TableInputFormat does not even exist in hbase-client API, is there any
> other way I can read from
> hbase? Thanks
>
> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>
>
> Configuration conf = HBaseConfiguration.*create*();
> conf.set(TableInputFormat.INPUT_TABLE, "Article");
>
>
> JavaPairRDD hBaseRDD = 
> sc.newAPIHadoopRDD(conf,
> TableInputFormat.*class*
> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
> org.apache.hadoop.hbase.client.Result.*class*);
>
>
>
>

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao

This problem was caused by the fact that I used a package jar with a Spark
version (0.9.1) different from that of the cluster (0.9.0). When I used the
correct package jar
(spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the
application can run as expected.



2014-09-15 14:57 GMT+08:00 x :

> How about this.
>
> scala> val rdd2 = rdd.combineByKey(
>  | (v: Int) => v.toLong,
>  | (c: Long, v: Int) => c + v,
>  | (c1: Long, c2: Long) => c1 + c2)
> rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at
> combineB
> yKey at :14
>
> xj @ Tokyo
>
> On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao 
> wrote:
>
>> I followd an example presented in the tutorial Learning Spark
>> 
>> to compute the per-key average as follows:
>>
>>
>> val Array(appName) = args
>> val sparkConf = new SparkConf()
>> .setAppName(appName)
>> val sc = new SparkContext(sparkConf)
>> /*
>>  * compute the per-key average of values
>>  * results should be:
>>  *A : 5.8
>>  *B : 14
>>  *C : 60.6
>>  */
>> val rdd = sc.parallelize(List(
>> ("A", 3), ("A", 9), ("A", 12), ("A", 0), ("A", 5),
>> ("B", 4), ("B", 10), ("B", 11), ("B", 20), ("B", 25),
>> ("C", 32), ("C", 91), ("C", 122), ("C", 3), ("C", 55)), 2)
>> val avg = rdd.combineByKey(
>> (x:Int) => (x, 1),  // java.lang.ClassCastException: scala.Tuple2$mcII$sp
>> cannot be cast to java.lang.Integer
>> (acc:(Int, Int), x) => (acc._1 + x, acc._2 + 1),
>> (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 +
>> acc2._2))
>> .map{case (s, t) => (s, t._1/t._2.toFloat)}
>>  avg.collect.foreach(t => println(t._1 + " ->" + t._2))
>>
>>
>>
>> When I submitted the application, an exception of 
>> "*java.lang.ClassCastException:
>> scala.Tuple2$mcII$sp cannot be cast to java.lang.Integer*" was thrown
>> out. The tutorial said that the first function of *combineByKey*, *(x:Int)
>> => (x, 1)*, should take a single element in the source RDD and return an
>> element of the desired type in the resulting RDD. In my application, we
>> take a single element of type *Int *from the source RDD and return a
>> tuple of type (*Int*, *Int*), which meets the requirements quite well.
>> But why would such an exception be thrown?
>>
>> I'm using CDH 5.0 and Spark 0.9
>>
>> Thanks.
>>
>>
>>
>

HBase and non-existent TableInputFormat

2014-09-16 Thread Y. Dong

Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sconf); 

Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "Article");

JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy

Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  "hdfs:///data.avro",
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s <- collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A


 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource(tm)
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li

Hi,

I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By following the instructions of
https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
I could successfully run everything until the step of SVMWithSGD, I got
error the following message. I don't know why the
 file:/root/test/sample_svm_data.txt does not exist since I already read it
out, printed it and converted into the labeled data and passed the parsed
data to the function SvmWithSGD.

Any one have the same issue with me?

Thanks,

Emily

 val model = SVMWithSGD.train(parsedData, numIterations)
14/09/16 10:55:21 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
(allowLocal=true)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
GeneralizedLinearAlgorithm.scala:121)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
locally
14/09/16 10:55:21 INFO HadoopRDD: Input split:
file:/root/test/sample_svm_data.txt:0+19737
14/09/16 10:55:21 INFO SparkContext: Job finished: first at
GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
14/09/16 10:55:21 INFO SparkContext: Starting job: count at
DataValidators.scala:37
14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
DataValidators.scala:37) with 2 output partitions (allowLocal=false)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
DataValidators.scala:37)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
at filter at DataValidators.scala:37), which has no missing parents
14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
12 (FilteredRDD[26] at filter at DataValidators.scala:37)
14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733 bytes
in 0 ms
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733 bytes
in 0 ms
14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at
org.apache.spark.deploy.SparkHadoop

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller

Hi All,

I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
"java.util.NoSuchElementException: key not found xyz", where "xyz"
denotes the ID of some particular task.  I've excerpted the log from
one job that died in this way below and attached the full log for
reference.

I suspect that my bug is the same as SPARK-2002 (linked below).  Is
there any reason to suspect otherwise?  Is there any known workaround
other than not coalescing?
https://issues.apache.org/jira/browse/SPARK-2002
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAMwrk0=d1dww5fdbtpkefwokyozltosbbjqamsqqjowlzng...@mail.gmail.com%3E

Note that I have been coalescing SchemaRDDs using "srdd =
SchemaRDD(srdd._jschema_rdd.coalesce(partitions, False, None),
sqlCtx)", the workaround described in this thread.
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3ccanr-kkciei17m43-yz5z-pj00zwpw3ka_u7zhve2y7ejw1v...@mail.gmail.com%3E

...
14/09/15 21:43:14 INFO scheduler.TaskSetManager: Starting task 78.0 in
stage 551.0 (TID 78738, bennett.research.intel-research.net,
PROCESS_LOCAL, 1056 bytes)
...
14/09/15 21:43:15 INFO storage.BlockManagerInfo: Added
taskresult_78738 in memory on
bennett.research.intel-research.net:38074 (size: 13.0 MB, free: 1560.8
MB)
...
14/09/15 21:43:15 ERROR scheduler.TaskResultGetter: Exception while
getting task result
java.util.NoSuchElementException: key not found: 78738
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.TaskSetManager.handleTaskGettingResult(TaskSetManager.scala:500)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleTaskGettingResult(TaskSchedulerImpl.scala:348)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:52)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)


I am running the pre-compiled 1.1.0 binaries.

best,
-Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Serving data

2014-09-16 Thread Yana Kadiyska

If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
something in your setup.

On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier  wrote:

> Writing to Parquet and querying the result via SparkSQL works great
> (except for some strange SQL parser errors). However the problem remains,
> how do I get that data back to a dashboard. So I guess I’ll have to use a
> database after all.
>
>
>>> You can batch up data & store into parquet partitions as well. & query
 it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.

>>>

Re: Reduce Tuple2 to Tuple2>>

2014-09-16 Thread Sean Owen

If you mean you have (key,value) pairs, and want pairs with key, and
all values for that key, then you're looking for groupByKey

On Tue, Sep 16, 2014 at 2:42 PM, Tom  wrote:
> From my map function I create Tuple2 pairs. Now I want to
> reduce them, and get something like Tuple2>.
>
> The only way I found to do this was by treating all variables as String, and
> in the reduceByKey do
> /return a._2 + "," + b._2/ //in which both are numeric values saved in a
> String
> After which I do a Arrays.asList(string.split(",")) in mapValues. This
> leaves me with >. So now I am looking for either
> - A function with which I can transform > to
> >
> or
> - A way to reduce Tuple2 into a Tuple2 List> in the reduceByKey function so that I can use Integers all
> the way
>
> Of course option two would have preferences.
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reduce Tuple2 to Tuple2>>

2014-09-16 Thread Tom

>From my map function I create Tuple2 pairs. Now I want to
reduce them, and get something like Tuple2>. 

The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + "," + b._2/ //in which both are numeric values saved in a
String
After which I do a Arrays.asList(string.split(",")) in mapValues. This
leaves me with >. So now I am looking for either
- A function with which I can transform > to
>
or
- A way to reduce Tuple2 into a Tuple2> in the reduceByKey function so that I can use Integers all
the way

Of course option two would have preferences.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

I dug a bit more and the executor ID is a number so it's seems there is not
possible workaround.

Looking at the code of the CoarseGrainedSchedulerBackend.scala:

https://github.com/apache/spark/blob/6324eb7b5b0ae005cb2e913e36b1508bd6f1b9b8/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L86

It seems there is only one DriverActor and, as the RegisterExecutor message
only contains the executorID, there is a collision.

I wonder if what I'm doing is wrong... basically, from the same scala
application I...

1. create one actor per job.
2. send a message to each actor with configuration details to create a
SparkContext/StreamingContext.
3. send a message to each actor to start the job and streaming context.

2014-09-16 13:29 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> When I said scheduler I meant executor backend.
>
> 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com>:
>
>> It seems that, as I have a single scala application, the scheduler is the
>> same and there is a collision between executors of both spark context. Is
>> there a way to change how the executor ID is generated (maybe an uuid
>> instead of a sequential number..?)
>>
>> 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez <
>> langel.gro...@gmail.com>:
>>
>>> I have a standalone spark cluster and from within the same scala
>>> application I'm creating 2 different spark context to run two different
>>> spark streaming jobs as SparkConf is different for each of them.
>>>
>>> I'm getting this error that... I don't really understand:
>>>
>>> 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.SparkConf.get(SparkConf.scala:149)
at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
 org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


>>>  Both apps has different names so I don't get how the executor ID is not
>>> unique :-S
>>>
>>> This happens on startup and most of the time only one job dies, but
>>> sometimes both of them die.
>>>
>>> Regards,
>>>
>>> Luis
>>>
>>
>>
>

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

When I said scheduler I meant executor backend.

2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> It seems that, as I have a single scala application, the scheduler is the
> same and there is a collision between executors of both spark context. Is
> there a way to change how the executor ID is generated (maybe an uuid
> instead of a sequential number..?)
>
> 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com>:
>
>> I have a standalone spark cluster and from within the same scala
>> application I'm creating 2 different spark context to run two different
>> spark streaming jobs as SparkConf is different for each of them.
>>
>> I'm getting this error that... I don't really understand:
>>
>> 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
>>> java.util.NoSuchElementException: spark.httpBroadcast.uri
>>> at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>>> at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>>> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>> at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>>> at org.apache.spark.SparkConf.get(SparkConf.scala:149)
>>> at 
>>> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
>>> at 
>>> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>>> at 
>>> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>>> at 
>>> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> at org.apache.spark.executor.Executor.(Executor.scala:85)
>>> at 
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> at 
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at 
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at 
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at 
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
>>> failed: Duplicate executor ID: 10
>>>
>>>
>>  Both apps has different names so I don't get how the executor ID is not
>> unique :-S
>>
>> This happens on startup and most of the time only one job dies, but
>> sometimes both of them die.
>>
>> Regards,
>>
>> Luis
>>
>
>

Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets

I am expand my data set and executing pyspark on yarn:
   I payed attention that only 2 processes processed the data:

14210 yarn  20   0 2463m 2.0g 9708 R 100.0  4.3   8:22.63 python2.7
32467 yarn  20   0 2519m 2.1g 9720 R 99.3  4.4   7:16.97   python2.7


*Question:*
   *how to configure pyspark to have more processes for  process the data?*


Here is my command :

  [hdfs@UCS-MASTER cad]$
/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/bin/spark-submit
--master yarn  --num-executors 12  --driver-memory 4g --executor-memory 2g
--py-files tad.zip --executor-cores 4   /usr/lib/cad/PrepareDataSetYarn.py
/input/tad/data.csv /output/cad_model_500_1

I tried to play with num-executors and executor-cores but it is still 2
python processes doing the job. I have 5 machine cluster with 32 GB ram.

console output:


14/09/16 20:07:34 INFO spark.SecurityManager: Changing view acls to: hdfs
14/09/16 20:07:34 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hdfs)
14/09/16 20:07:34 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/09/16 20:07:35 INFO Remoting: Starting remoting
14/09/16 20:07:35 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO spark.SparkEnv: Registering MapOutputTracker
14/09/16 20:07:35 INFO spark.SparkEnv: Registering BlockManagerMaster
14/09/16 20:07:35 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140916200735-53f6
14/09/16 20:07:35 INFO storage.MemoryStore: MemoryStore started with
capacity 2.3 GB.
14/09/16 20:07:35 INFO network.ConnectionManager: Bound socket to port
37255 with id = ConnectionManagerId(UCS-MASTER.sms1.local,37255)
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/16 20:07:35 INFO storage.BlockManagerInfo: Registering block manager
UCS-MASTER.sms1.local:37255 with 2.3 GB RAM
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:55286
14/09/16 20:07:35 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.193.218.2:55286
14/09/16 20:07:35 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-ca8193be-9148-4e7e-a0cc-4b6e7cb72172
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:38065
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/09/16 20:07:35 INFO ui.SparkUI: Started SparkUI at
http://UCS-MASTER.sms1.local:4040
14/09/16 20:07:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
--args is deprecated. Use --arg instead.
14/09/16 20:07:36 INFO impl.TimelineClientImpl: Timeline service address:
http://UCS-NODE1.sms1.local:8188/ws/v1/timeline/
14/09/16 20:07:37 INFO client.RMProxy: Connecting to ResourceManager at
UCS-NODE1.sms1.local/10.193.218.3:8050
14/09/16 20:07:37 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 5
14/09/16 20:07:37 INFO yarn.Client: Queue info ... queueName: default,
queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
14/09/16 20:07:37 INFO yarn.Client: Max mem capabililty of a single
resource in this cluster 53248
14/09/16 20:07:37 INFO yarn.Client: Preparing Local resources
14/09/16 20:07:37 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
14/09/16 20:07:37 INFO yarn.Client: Uploading
file:/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/lib/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
14/09/16 20:07:39 INFO yarn.Client: Uploading
file:/usr/lib/cad/PrepareDataSetYarn.py to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/PrepareDataSetYarn.py
14/09/16 20:07:39 INFO yarn.Client: Uploading file:/usr/lib/cad/tad.zip to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/tad.zip
14/09/16 20:07:39 INFO yarn.Client: Setting up the launch environment
14/09/16 20:07:39 INFO yarn.Client: Setting up container launch context
14/09/16 20:07:39 INFO yarn.Client: Command for starting the Spark
ApplicationMaster: List($JAVA_HOME/bin/j

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

It seems that, as I have a single scala application, the scheduler is the
same and there is a collision between executors of both spark context. Is
there a way to change how the executor ID is generated (maybe an uuid
instead of a sequential number..?)

2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> I have a standalone spark cluster and from within the same scala
> application I'm creating 2 different spark context to run two different
> spark streaming jobs as SparkConf is different for each of them.
>
> I'm getting this error that... I don't really understand:
>
> 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
>> java.util.NoSuchElementException: spark.httpBroadcast.uri
>>  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>>  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>>  at org.apache.spark.SparkConf.get(SparkConf.scala:149)
>>  at 
>> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
>>  at 
>> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>>  at 
>> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>>  at 
>> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>>  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>  at org.apache.spark.executor.Executor.(Executor.scala:85)
>>  at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
>>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>  at 
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>  at 
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>  at 
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>  at 
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
>> failed: Duplicate executor ID: 10
>>
>>
>  Both apps has different names so I don't get how the executor ID is not
> unique :-S
>
> This happens on startup and most of the time only one job dies, but
> sometimes both of them die.
>
> Regards,
>
> Luis
>

Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

I have a standalone spark cluster and from within the same scala
application I'm creating 2 different spark context to run two different
spark streaming jobs as SparkConf is different for each of them.

I'm getting this error that... I don't really understand:

14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
> java.util.NoSuchElementException: spark.httpBroadcast.uri
>   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>   at org.apache.spark.SparkConf.get(SparkConf.scala:149)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
>   at 
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>   at 
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>   at 
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>   at org.apache.spark.executor.Executor.(Executor.scala:85)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
> failed: Duplicate executor ID: 10
>
>
 Both apps has different names so I don't get how the executor ID is not
unique :-S

This happens on startup and most of the time only one job dies, but
sometimes both of them die.

Regards,

Luis

Re: How to set executor num on spark on yarn

2014-09-16 Thread Sean Owen

How many cores do your machines have? --executor-cores should be the
number of cores each executor uses. Fewer cores means more executors
in general. From your data, it sounds like, for example, there are 7
nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores
available. Hence when you ask for fewer cores per executor, more can
fit.

You may need to increase the number of cores you are letting YARN
manage if this doesn't match your expectation. For example, I'm
guessing your machines have more than 4 cores in reality.

On Tue, Sep 16, 2014 at 6:08 AM, hequn cheng  wrote:
> hi~I want to set the executor number to 16, but it is very strange that
> executor cores may affect executor num on spark on yarn, i don't know why
> and how to set executor number.
> =
> ./bin/spark-submit --class com.hequn.spark.SparkJoins \
> --master yarn-cluster \
> --num-executors 16 \
> --driver-memory 2g \
> --executor-memory 10g \
> --executor-cores 4 \
> /home/sparkjoins-1.0-SNAPSHOT.jar
>
> The UI shows there are 7 executors
> =
> ./bin/spark-submit --class com.hequn.spark.SparkJoins \
> --master yarn-cluster \
> --num-executors 16 \
> --driver-memory 2g \
> --executor-memory 10g \
> --executor-cores 2 \
> /home/sparkjoins-1.0-SNAPSHOT.jar
>
> The UI shows there are 9 executors
> =
> ./bin/spark-submit --class com.hequn.spark.SparkJoins \
> --master yarn-cluster \
> --num-executors 16 \
> --driver-memory 2g \
> --executor-memory 10g \
> --executor-cores 1 \
> /home/sparkjoins-1.0-SNAPSHOT.jar
>
> The UI shows there are 9 executors
> ==
> The cluster contains 16 nodes. Each node 64G RAM.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Sean Owen

>From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.

So when you build your app, you mark Spark as a 'provided' dependency.
Therefore in general, no, you do not build Spark for yourself if you
are a Spark app creator.

(Of course, your app would care if it were also using Hadoop libraries
directly. In that case, you will want to depend on hadoop-client, and
the right version for your cluster, but still mark it as provided.)

The version Spark is built against only matters when you are deploying
Spark's artifacts on the cluster to set it up.

Your error suggests there is still a version mismatch. Either you
deployed a build that was not compatible, or, maybe you are packaging
a version of Spark with your app which is incompatible and
interfering.

For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
your app, when it shouldn't be packaged.

Spark works out of the box with just about any modern combo of HDFS and YARN.

On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais  wrote:
> Dear List,
>
> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
> reading SequenceFiles.  In particular, I'm seeing:
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> Server IPC version 7 cannot communicate with client version 4
> at org.apache.hadoop.ipc.Client.call(Client.java:1070)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
> ...
>
> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
> the unit test here:
> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
> )
>
>
> This error indicates to me that Spark is using an old hadoop client to
> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>
>
> Do I need to explicitly build spark for modern hadoop??  I previously
> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
> error (server is using version 9, client is using version 4).
>
>
> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
> on spark's site:
>  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>
>
> What distro of hadoop is used at Data Bricks?  Are there distros of
> Spark 1.1 and hadoop that should work together out-of-the-box?
> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>
> Thanks for any help anybody can give me here!
> -Paul
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave

At 2014-09-16 12:23:10 +0200, Yifan LI  wrote:
> but I am wondering if there is a message(none?) sent to the target vertex(the 
> rank change is less than tolerance) in below dynamic page rank implementation,
>
>  def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
>   if (edge.srcAttr._2 > tol) {
> Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
>   } else {
> Iterator.empty
>   }
> }
>
> so, in this case, there is a message, even is none, is still sent? or not?

No, in that case no message is sent, and if all in-edges of a particular vertex 
return Iterator.empty, then the vertex will become inactive in the next 
iteration.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI

Thanks, :)

but I am wondering if there is a message(none?) sent to the target vertex(the 
rank change is less than tolerance) in below dynamic page rank implementation,

 def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
  if (edge.srcAttr._2 > tol) {
Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
  } else {
Iterator.empty
  }
}

so, in this case, there is a message, even is none, is still sent? or not?


Best,
Yifan

On 16 Sep 2014, at 11:48, Ankur Dave  wrote:

> At 2014-09-16 10:55:37 +0200, Yifan LI  wrote:
>> - from [1], and my understanding, the existing inactive feature in graphx 
>> pregel api is “if there is no in-edges, from active vertex, to this vertex, 
>> then we will say this one is inactive”, right?
> 
> Well, that's true when messages are only sent forward along edges (from the 
> source to the destination) and the activeDirection is EdgeDirection.Out. If 
> both of these conditions are true, then a vertex without in-edges cannot 
> receive a message, and therefore its vertex program will never run and a 
> message will never be sent along its out-edges. PageRank is an application 
> that satisfies both the conditions.
> 
>> For instance, there is a graph in which every vertex has at least one 
>> in-edges, then we run static Pagerank on it for 10 iterations. During this 
>> calculation, is there any vertex would be set inactive?
> 
> No: since every vertex always sends a message in static PageRank, if every 
> vertex has an in-edge, it will always receive a message and will remain 
> active.
> 
> In fact, this is why I recently rewrote static PageRank not to use Pregel 
> [3]. Assuming that most vertices do have in-edges, it's unnecessary to track 
> active vertices, which can provide a big savings.
> 
>> - for more “explicit active vertex tracking”, e.g. vote to halt, how to 
>> achieve it in existing api?
>> (I am not sure I got the point of [2], that “vote” function has already been 
>> introduced in graphx pregel api? )
> 
> The current Pregel API effectively makes every vertex vote to halt in every 
> superstep. Therefore only vertices that receive messages get awoken in the 
> next superstep.
> 
> Instead, [2] proposes to make every vertex run by default in every superstep 
> unless it votes to halt *and* receives no messages. This allows a vertex to 
> have more control over whether or not it will run, rather than leaving that 
> entirely up to its neighbors.
> 
> Ankur
> 
>>> [1] 
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
>>> [2] https://github.com/apache/spark/pull/1217
> [3] https://github.com/apache/spark/pull/2308

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave

At 2014-09-16 10:55:37 +0200, Yifan LI  wrote:
> - from [1], and my understanding, the existing inactive feature in graphx 
> pregel api is “if there is no in-edges, from active vertex, to this vertex, 
> then we will say this one is inactive”, right?

Well, that's true when messages are only sent forward along edges (from the 
source to the destination) and the activeDirection is EdgeDirection.Out. If 
both of these conditions are true, then a vertex without in-edges cannot 
receive a message, and therefore its vertex program will never run and a 
message will never be sent along its out-edges. PageRank is an application that 
satisfies both the conditions.

> For instance, there is a graph in which every vertex has at least one 
> in-edges, then we run static Pagerank on it for 10 iterations. During this 
> calculation, is there any vertex would be set inactive?

No: since every vertex always sends a message in static PageRank, if every 
vertex has an in-edge, it will always receive a message and will remain active.

In fact, this is why I recently rewrote static PageRank not to use Pregel [3]. 
Assuming that most vertices do have in-edges, it's unnecessary to track active 
vertices, which can provide a big savings.

> - for more “explicit active vertex tracking”, e.g. vote to halt, how to 
> achieve it in existing api?
> (I am not sure I got the point of [2], that “vote” function has already been 
> introduced in graphx pregel api? )

The current Pregel API effectively makes every vertex vote to halt in every 
superstep. Therefore only vertices that receive messages get awoken in the next 
superstep.

Instead, [2] proposes to make every vertex run by default in every superstep 
unless it votes to halt *and* receives no messages. This allows a vertex to 
have more control over whether or not it will run, rather than leaving that 
entirely up to its neighbors.

Ankur

>> [1] 
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
>> [2] https://github.com/apache/spark/pull/1217
[3] https://github.com/apache/spark/pull/2308

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI

Dear Ankur,

Thanks! :)

- from [1], and my understanding, the existing inactive feature in graphx 
pregel api is “if there is no in-edges, from active vertex, to this vertex, 
then we will say this one is inactive”, right?

For instance, there is a graph in which every vertex has at least one in-edges, 
then we run static Pagerank on it for 10 iterations. During this calculation, 
is there any vertex would be set inactive?


- for more “explicit active vertex tracking”, e.g. vote to halt, how to achieve 
it in existing api?
(I am not sure I got the point of [2], that “vote” function has already been 
introduced in graphx pregel api? )


Best,
Yifan LI

On 15 Sep 2014, at 23:07, Ankur Dave  wrote:

> At 2014-09-15 16:25:04 +0200, Yifan LI  wrote:
>> I am wondering if the vertex active/inactive(corresponding the change of its 
>> value between two supersteps) feature is introduced in Pregel API of GraphX?
> 
> Vertex activeness in Pregel is controlled by messages: if a vertex did not 
> receive a message in the previous iteration, its vertex program will not run 
> in the current iteration. Also, inactive vertices will not be able to send 
> messages because by default the sendMsg function will only be run on edges 
> where at least one of the adjacent vertices received a message. You can 
> change this behavior -- see the documentation for the activeDirection 
> parameter to Pregel.apply [1].
> 
> There is also an open pull request to make active vertex tracking more 
> explicit by allowing vertices to vote to halt directly [2].
> 
> Ankur
> 
> [1] 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
> [2] https://github.com/apache/spark/pull/1217

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao

Thank you for pasting the steps, I will look at this, hopefully come out with a 
solution soon.

-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com] 
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect to 
Zookeeper.

However, I found some more interesting things. And I think I have found a bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database "use dw_op1"; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired locks.

(2) Drop table "drop table src;" (Blocked) The logs show that the thriftServer 
is "acquireReadWriteLocks".

Doubt:
The reason why I cannot drop table src is because the first SQL "use dw_op1"
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the default 
database;
(0) Restart thriftServer & use beeline to connect to thriftServer.
(1) Drop table "drop table src"; (OK)
Amazing! Succeed!
(2) Drop again!  "drop table src2;" (Blocked) Same error: the thriftServer is 
blocked in the "acquireReadWriteLocks"
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks correctly 
in Zookeeper.

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Serving data

2014-09-16 Thread Marius Soutier

Writing to Parquet and querying the result via SparkSQL works great (except for 
some strange SQL parser errors). However the problem remains, how do I get that 
data back to a dashboard. So I guess I’ll have to use a database after all.


You can batch up data & store into parquet partitions as well. & query it using 
another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

sorry for disturb, please ignore this mail
in the end, I find it slow because lack of memory in my machine..

sorry again.

On Tue, Sep 16, 2014 at 3:26 PM, 诺铁  wrote:

> I connect my sample project to a hosted CI service, it only takes 3
> seconds to run there...while the same tests takes 2minutes on my macbook
> pro.  so maybe this is a mac os specific problem?
>
> On Tue, Sep 16, 2014 at 3:06 PM, 诺铁  wrote:
>
>> hi,
>>
>> I am trying to write some unit test, following spark programming guide
>> 
>> .
>> but I observed unit test runs very slow(the code is just a SparkPi),  so
>> I turn log level to trace and look through the log output.  and found
>> creation of SparkContext seems to take most time.
>>
>> there are some action take a lot of time:
>> 1, seems starting jetty
>> 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
>> o.e.jetty.servlet.ServletHandler -
>> servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
>> 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
>> o.e.jetty.util.component.Container - Container
>> org.eclipse.jetty.server.Server@e3cee7b +
>> SelectChannelConnector@0.0.0.0:4040 as connector
>>
>> 2, I don't know what's this
>> 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
>> org.apache.hadoop.security.Groups - Group mapping
>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>> cacheTimeout=30
>> 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
>> o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
>> beats in BlockManagerMaster.
>>
>>
>> are there any way to make this faster for unit test?
>> I also notice in spark's unit test, that there exists a
>> SharedSparkContext
>> ,
>> is this the suggested way to work around this problem? if so, I would
>> suggest to document this in programming guide.
>>
>
>

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy

it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua

Is 1.0.8 working for you ?

You indicated your last known good version is 1.0.0

Maybe we can track down where it broke. 



> On Sep 16, 2014, at 12:25 AM, Paul Wais  wrote:
> 
> Thanks Christian!  I tried compiling from source but am still getting the 
> same hadoop client version error when reading from HDFS.  Will have to poke 
> deeper... perhaps I've got some classpath issues.  FWIW I compiled using:
> 
> $ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn 
> -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 
> 
> and hadoop 2.3 / cdh5 from 
> http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
> 
> 
> 
> 
> 
>> On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua  wrote:
>> Hi Paul.
>> 
>> I would recommend building your own 1.1.0 distribution.
>> 
>>  ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn 
>> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>> 
>> 
>> 
>> I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this strange 
>> behavior where
>> 
>>  spark-submit --master yarn-cluster ...
>> 
>> will work, but
>> 
>>  spark-submit --master yarn-client ...
>> 
>> will fail.
>> 
>> 
>> But on the personal build obtained from the command above, both will then 
>> work.
>> 
>> 
>> -Christian
>> 
>> 
>> 
>> 
>>> On Sep 15, 2014, at 6:28 PM, Paul Wais  wrote:
>>> 
>>> Dear List,
>>> 
>>> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
>>> reading SequenceFiles.  In particular, I'm seeing:
>>> 
>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>>> Server IPC version 7 cannot communicate with client version 4
>>>at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>>>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>>at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>>>...
>>> 
>>> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
>>> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
>>> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
>>> the unit test here:
>>> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
>>> )
>>> 
>>> 
>>> This error indicates to me that Spark is using an old hadoop client to
>>> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
>>> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>>> 
>>> 
>>> Do I need to explicitly build spark for modern hadoop??  I previously
>>> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
>>> error (server is using version 9, client is using version 4).
>>> 
>>> 
>>> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
>>> on spark's site:
>>> * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>>> * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>>> 
>>> 
>>> What distro of hadoop is used at Data Bricks?  Are there distros of
>>> Spark 1.1 and hadoop that should work together out-of-the-box?
>>> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>>> 
>>> Thanks for any help anybody can give me here!
>>> -Paul
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Broadcast error

2014-09-16 Thread Chengi Liu

Cool.. While let me try that.. any other suggestion(s) on things I can try?

On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu  wrote:

> I think the 1.1 will be really helpful for you, it's all compatitble
> with 1.0, so it's
> not hard to upgrade to 1.1.
>
> On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu 
> wrote:
> > So.. same result with parallelize (matrix,1000)
> > with broadcast.. seems like I got jvm core dump :-/
> > 4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
> host:47978
> > with 19.2 GB RAM
> > 14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
> > host:43360 with 19.2 GB RAM
> > Unhandled exception
> > Unhandled exception
> > Type=Segmentation error vmState=0x
> > J9Generic_Signal_Number=0004 Signal_Number=000b
> Error_Value=
> > Signal_Code=0001
> > Handler1=2BF53760 Handler2=2C3069D0
> > InaccessibleAddress=
> > RDI=2AB9505F2698 RSI=2AABAE2C54D8 RAX=2AB7CE6009A0
> > RBX=2AB7CE6009C0
> > RCX=FFC7FFE0 RDX=2AB8509726A8 R8=7FE41FF0
> > R9=2000
> > R10=2DA318A0 R11=2AB850959520 R12=2AB5EF97DD88
> > R13=2AB5EF97BD88
> > R14=2C0CE940 R15=2AB5EF97BD88
> > RIP= GS= FS= RSP=007367A0
> > EFlags=00210282 CS=0033 RBP=00BCDB00 ERR=0014
> > TRAPNO=000E OLDMASK= CR2=
> > xmm0 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm1 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm2 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm3 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm4 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm5 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm6 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
> > xmm7 f180c714f8e2a139 (f: 4175601920.00, d: -5.462583e+238)
> > xmm8 428e8000 (f: 1116635136.00, d: 5.516911e-315)
> > xmm9  (f: 0.00, d: 0.00e+00)
> > xmm10  (f: 0.00, d: 0.00e+00)
> > xmm11  (f: 0.00, d: 0.00e+00)
> > xmm12  (f: 0.00, d: 0.00e+00)
> > xmm13  (f: 0.00, d: 0.00e+00)
> > xmm14  (f: 0.00, d: 0.00e+00)
> > xmm15  (f: 0.00, d: 0.00e+00)
> > Target=2_60_20140106_181350 (Linux 3.0.93-0.8.2_1.0502.8048-cray_ari_c)
> > CPU=amd64 (48 logical CPUs) (0xfc0c5b000 RAM)
> > --- Stack Backtrace ---
> > (0x2C2FA122 [libj9prt26.so+0x13122])
> > (0x2C30779F [libj9prt26.so+0x2079f])
> > (0x2C2F9E6B [libj9prt26.so+0x12e6b])
> > (0x2C2F9F67 [libj9prt26.so+0x12f67])
> > (0x2C30779F [libj9prt26.so+0x2079f])
> > (0x2C2F9A8B [libj9prt26.so+0x12a8b])
> > (0x2BF52C9D [libj9vm26.so+0x1ac9d])
> > (0x2C30779F [libj9prt26.so+0x2079f])
> > (0x2BF52F56 [libj9vm26.so+0x1af56])
> > (0x2BF96CA0 [libj9vm26.so+0x5eca0])
> > ---
> > JVMDUMP039I
> > JVMDUMP032I
> >
> >
> > Note, this still is with the framesize I modified in the last email
> thread?
> >
> > On Mon, Sep 15, 2014 at 2:12 AM, Akhil Das 
> > wrote:
> >>
> >> Try:
> >>
> >> rdd = sc.broadcast(matrix)
> >>
> >> Or
> >>
> >> rdd = sc.parallelize(matrix,100) // Just increase the number of slices,
> >> give it a try.
> >>
> >>
> >>
> >> Thanks
> >> Best Regards
> >>
> >> On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu 
> >> wrote:
> >>>
> >>> Hi Akhil,
> >>>   So with your config (specifically with set("spark.akka.frameSize ",
> >>> "1000")) , I see the error:
> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
> >>> Serialized task 0:0 was 401970046 bytes which exceeds
> spark.akka.frameSize
> >>> (10485760 bytes). Consider using broadcast variables for large values.
> >>> at
> >>> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> >>> at org.apache.spark
> >>>
> >>> So, I changed
> >>> set("spark.akka.frameSize ", "1000") to set("spark.akka.frameSize
> ",
> >>> "10")
> >>> but now I get the same error?
> >>>
> >>> y4j.protocol.Py4JJavaError: An error occurred while calling
> >>> o28.trainKMeansModel.
> >>> : org.apache.spark.SparkException: Job aborted due to stage failure:
> All
> >>> masters are unresponsive! Giving up.
> >>> at
> >>> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> >>> at
> >>>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
> >>> at
> >>>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> >>> at
> >>>
> scala.collection.mutable.ResizableArray$class.forea

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

I connect my sample project to a hosted CI service, it only takes 3 seconds
to run there...while the same tests takes 2minutes on my macbook pro.  so
maybe this is a mac os specific problem?

On Tue, Sep 16, 2014 at 3:06 PM, 诺铁  wrote:

> hi,
>
> I am trying to write some unit test, following spark programming guide
> 
> .
> but I observed unit test runs very slow(the code is just a SparkPi),  so I
> turn log level to trace and look through the log output.  and found
> creation of SparkContext seems to take most time.
>
> there are some action take a lot of time:
> 1, seems starting jetty
> 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
> o.e.jetty.servlet.ServletHandler -
> servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
> 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
> o.e.jetty.util.component.Container - Container
> org.eclipse.jetty.server.Server@e3cee7b +
> SelectChannelConnector@0.0.0.0:4040 as connector
>
> 2, I don't know what's this
> 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
> org.apache.hadoop.security.Groups - Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
> o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
> beats in BlockManagerMaster.
>
>
> are there any way to make this faster for unit test?
> I also notice in spark's unit test, that there exists a SharedSparkContext
> ,
> is this the suggested way to work around this problem? if so, I would
> suggest to document this in programming guide.
>

Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk

Hi Michael,
Please correct me if I am wrong. The error seems to originate from spark
only. Please have a look at the stack trace of the error which is as
follows:

[error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any
suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row
java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at scala.util.Try.getOrElse(Try.scala:77)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$.resolveConstructor(AnyObjectFactory.scala:133)
at
com.datastax.spark.connector.mapper.ReflectionColumnMapper.columnMap(ReflectionColumnMapper.scala:46)
at
com.datastax.spark.connector.writer.DefaultRowWriter.(DefaultRowWriter.scala:20)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107)
at
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171)
at
com.datastax.spark.connector.RDDFunctions.tableWriter(RDDFunctions.scala:76)
at
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:56)
at ScalaSparkCassandra$.main(ScalaSparkCassandra.scala:66)
at ScalaSparkCassandra.main(ScalaSparkCassandra.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

Please clarify.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951p14340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais

Thanks Christian!  I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS.  Will have to poke
deeper... perhaps I've got some classpath issues.  FWIW I compiled using:

$ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

and hadoop 2.3 / cdh5 from
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz





On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua  wrote:

> Hi Paul.
>
> I would recommend building your own 1.1.0 distribution.
>
> ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>
>
>
> I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this
> strange behavior where
>
> spark-submit --master yarn-cluster ...
>
> will work, but
>
> spark-submit --master yarn-client ...
>
> will fail.
>
>
> But on the personal build obtained from the command above, both will then
> work.
>
>
> -Christian
>
>
>
>
> On Sep 15, 2014, at 6:28 PM, Paul Wais  wrote:
>
> Dear List,
>
> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
> reading SequenceFiles.  In particular, I'm seeing:
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> Server IPC version 7 cannot communicate with client version 4
>at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>...
>
> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
> the unit test here:
>
> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
> )
>
>
> This error indicates to me that Spark is using an old hadoop client to
> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>
>
> Do I need to explicitly build spark for modern hadoop??  I previously
> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
> error (server is using version 9, client is using version 4).
>
>
> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
> on spark's site:
> * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
> * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>
>
> What distro of hadoop is used at Data Bricks?  Are there distros of
> Spark 1.1 and hadoop that should work together out-of-the-box?
> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>
> Thanks for any help anybody can give me here!
> -Paul
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread linkpatrickliu

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect
to Zookeeper.

However, I found some more interesting things. And I think I have found a
bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database "use dw_op1"; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired
locks.

(2) Drop table "drop table src;" (Blocked)
The logs show that the thriftServer is "acquireReadWriteLocks".

Doubt:
The reason why I cannot drop table src is because the first SQL "use dw_op1"
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the
default database;
(0) Restart thriftServer & use beeline to connect to thriftServer.
(1) Drop table "drop table src"; (OK)
Amazing! Succeed!
(2) Drop again!  "drop table src2;" (Blocked)
Same error: the thriftServer is blocked in the "acquireReadWriteLocks"
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks
correctly in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

hi,

I am trying to write some unit test, following spark programming guide
.
but I observed unit test runs very slow(the code is just a SparkPi),  so I
turn log level to trace and look through the log output.  and found
creation of SparkContext seems to take most time.

there are some action take a lot of time:
1, seems starting jetty
14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.servlet.ServletHandler -
servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.util.component.Container - Container
org.eclipse.jetty.server.Server@e3cee7b +
SelectChannelConnector@0.0.0.0:4040 as connector

2, I don't know what's this
14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
org.apache.hadoop.security.Groups - Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
beats in BlockManagerMaster.


are there any way to make this faster for unit test?
I also notice in spark's unit test, that there exists a SharedSparkContext
,
is this the suggested way to work around this problem? if so, I would
suggest to document this in programming guide.

81 matches

Mail list logo