Broadcasting Large Objects Fails?

2015-02-22 Thread TJ Klein
Hi,

I am trying to broadcast large objects (order of a couple of 100 MBs).
However, I keep getting errors when trying to do so:

Traceback (most recent call last):
  File /LORM_experiment.py, line 510, in module
broadcast_gradient_function = sc.broadcast(gradient_function)
  File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in
broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
  File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65,
in __init__
self._path = self.dump(value, f)
  File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82,
in dump
cPickle.dump(value, f, 2)
SystemError: error return without exception set
15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark
local dirs
java.lang.IllegalStateException: Shutdown in progress

Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any
problem with that.

Thanks,
 Tassilo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running Example Spark Program

2015-02-22 Thread Surendran Duraisamy

Hello All,

I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark 
Examples in my Ubuntu System.


I downloaded spark-1.2.1-bin-hadoop2.4.tgz 
http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz 
and started sbin/start-master.sh


After starting Spark and access http://localhost:8080/ to look at the 
status of my Spark Instance, and it shows as follows.


 * *URL:*spark://vm:7077
 * *Workers:*0
 * *Cores:*0 Total, 0 Used
 * *Memory:*0.0 B Total, 0.0 B Used
 * *Applications:*0 Running, 4 Completed
 * *Drivers:*0 Running, 0 Completed
 * *Status:*ALIVE

Number of Cores is 0 and Memory is 0.0B. I think because of this I am 
getting following error when I try to run JavaKMeans.java


Initial job has not accepted any resources; check your cluster UI to 
ensure that workers are registered and have sufficient memory


Am I missing any configuration before running sbin/start-master.sh?

Regards,
Surendran


Re: Running Example Spark Program

2015-02-22 Thread Jason Bell
If you would like a morr detailed walkthrough I wrote one recently.

https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/

Regards
Jason Bell
 On 22 Feb 2015 14:16, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com
wrote:

 Try restarting your Spark cluster .
 ./sbin/stop-all.sh
 ./sbin/start-all.sh

 Thanks,
 Vishnu

 On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 
 2013ht12...@wilp.bits-pilani.ac.in wrote:

  Hello All,

 I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark
 Examples in my Ubuntu System.

 I downloaded spark-1.2.1-bin-hadoop2.4.tgz
 http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz
 and started sbin/start-master.sh

 After starting Spark and access http://localhost:8080/ to look at the
 status of my Spark Instance, and it shows as follows.


- *URL:* spark://vm:7077
- *Workers:* 0
- *Cores:* 0 Total, 0 Used
- *Memory:* 0.0 B Total, 0.0 B Used
- *Applications:* 0 Running, 4 Completed
- *Drivers:* 0 Running, 0 Completed
- *Status:* ALIVE

 Number of Cores is 0 and Memory is 0.0B. I think because of this I am
 getting following error when I try to run JavaKMeans.java

 Initial job has not accepted any resources; check your cluster UI to
 ensure that workers are registered and have sufficient memory

 Am I missing any configuration before running sbin/start-master.sh?
  Regards,
 Surendran





RE: Spark SQL odbc on Windows

2015-02-22 Thread Ashic Mahtab
Hi Francisco,While I haven't tried this, have a look at the contents of 
start-thriftserver.sh - all it's doing is setting up a few variables and 
calling:
/bin/spark-submit --class 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
and passing some additional parameters. Perhaps doing the same would work?
I also believe that this hosts a jdbc server (not odbc), but there's a free 
odbc connector from databricks built by Simba, with which I've been able to 
connect to a spark cluster hosted on linux.
-Ashic.

To: user@spark.apache.org
From: forch...@gmail.com
Subject: Spark SQL odbc on Windows
Date: Sun, 22 Feb 2015 09:45:03 +0100

Hello, 
I work on a MS consulting company and we are evaluating including SPARK on our 
BigData offer. We are particulary interested into testing SPARK as rolap engine 
for SSAS but we cannot find a way to activate the odbc server (thrift) on a 
Windows custer. There is no start-thriftserver.sh command available for 
windows. 

Somebody knows if there is a way to make this work? 

Thanks in advance!!
Francisco 

Re: Running Example Spark Program

2015-02-22 Thread VISHNU SUBRAMANIAN
Try restarting your Spark cluster .
./sbin/stop-all.sh
./sbin/start-all.sh

Thanks,
Vishnu

On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 
2013ht12...@wilp.bits-pilani.ac.in wrote:

  Hello All,

 I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark
 Examples in my Ubuntu System.

 I downloaded spark-1.2.1-bin-hadoop2.4.tgz
 http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz
 and started sbin/start-master.sh

 After starting Spark and access http://localhost:8080/ to look at the
 status of my Spark Instance, and it shows as follows.


- *URL:* spark://vm:7077
- *Workers:* 0
- *Cores:* 0 Total, 0 Used
- *Memory:* 0.0 B Total, 0.0 B Used
- *Applications:* 0 Running, 4 Completed
- *Drivers:* 0 Running, 0 Completed
- *Status:* ALIVE

 Number of Cores is 0 and Memory is 0.0B. I think because of this I am
 getting following error when I try to run JavaKMeans.java

 Initial job has not accepted any resources; check your cluster UI to
 ensure that workers are registered and have sufficient memory

 Am I missing any configuration before running sbin/start-master.sh?
  Regards,
 Surendran



Re: Broadcasting Large Objects Fails?

2015-02-22 Thread Tassilo Klein
Hi Akhil,

thanks for your reply. I am using the latest version of Spark 1.2.1 (also
tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is
the default there, isn't it?

Thanks,
 Tassilo

On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you try with torrent broadcast factory?

 Thanks
 Best Regards

 On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote:

 Hi,

 I am trying to broadcast large objects (order of a couple of 100 MBs).
 However, I keep getting errors when trying to do so:

 Traceback (most recent call last):
   File /LORM_experiment.py, line 510, in module
 broadcast_gradient_function = sc.broadcast(gradient_function)
   File /scratch/users/213444/spark/python/pyspark/context.py, line 643,
 in
 broadcast
 return Broadcast(self, value, self._pickled_broadcast_vars)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65,
 in __init__
 self._path = self.dump(value, f)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82,
 in dump
 cPickle.dump(value, f, 2)
 SystemError: error return without exception set
 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark
 local dirs
 java.lang.IllegalStateException: Shutdown in progress

 Any idea how to prevent that? I got plenty of RAM, so there shouldn't be
 any
 problem with that.

 Thanks,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Running Example Spark Program

2015-02-22 Thread Surendran Duraisamy

Thank You Jason,
Got the program working after setting

SPARK_WORKER_CORES
SPARK_WORKER_MEMORY

While running the program from eclipse, got strange ClassNotFoundException.
In JavaKMeans.java, ParsePoint is static inner class. When running the 
program I got ClassNotFound for ParsePoint.


I have to set

String jars[] = { 
/home/surendran/workspace/SparkKMeans/target/SparkKMeans-0.0.1-SNAPSHOT.jar 
};

sparkConf.setJars(jars);
(or) to set SPARK_CLASSPATH to my jar
to get the program working.
What if I modify ParsePoint? I need to rebuild to bring the changes and 
run from eclipse?


Regards,
Surendran


On Sunday 22 February 2015 07:55 PM, Jason Bell wrote:


If you would like a morr detailed walkthrough I wrote one recently.

https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/

Regards
Jason Bell

On 22 Feb 2015 14:16, VISHNU SUBRAMANIAN 
johnfedrickena...@gmail.com mailto:johnfedrickena...@gmail.com wrote:


Try restarting your Spark cluster .
./sbin/stop-all.sh
./sbin/start-all.sh

Thanks,
Vishnu

On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy
2013ht12...@wilp.bits-pilani.ac.in
mailto:2013ht12...@wilp.bits-pilani.ac.in wrote:

Hello All,

I am new to Apache Spark, I am trying to run JavaKMeans.java
from Spark Examples in my Ubuntu System.

I downloaded spark-1.2.1-bin-hadoop2.4.tgz

http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz
and started sbin/start-master.sh

After starting Spark and access http://localhost:8080/ to look
at the status of my Spark Instance, and it shows as follows.

  * *URL:*spark://vm:7077
  * *Workers:*0
  * *Cores:*0 Total, 0 Used
  * *Memory:*0.0 B Total, 0.0 B Used
  * *Applications:*0 Running, 4 Completed
  * *Drivers:*0 Running, 0 Completed
  * *Status:*ALIVE

Number of Cores is 0 and Memory is 0.0B. I think because of
this I am getting following error when I try to run
JavaKMeans.java

Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have
sufficient memory

Am I missing any configuration before running
sbin/start-master.sh?

Regards,
Surendran






Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Hi Francisco,

Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular)
from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.

The one thing that you may run into is that the SQL generated by SSAS can
be quite convoluted. When we were doing the same thing to try to get SSAS
to connect to Hive (ref paper at
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
that was definitely a blocker. Note that Spark SQL is different than HIVEQL
but you may run into the same issue. If so, the trick you may want to use
is similar to the paper - use a SQL Server linked server connection and
have SQL Server be your translator for the SQL generated by SSAS.

HTH!
Denny

On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:

 Hi Francisco,
 While I haven't tried this, have a look at the contents of
 start-thriftserver.sh - all it's doing is setting up a few variables and
 calling:

 /bin/spark-submit --class
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

 and passing some additional parameters. Perhaps doing the same would work?

 I also believe that this hosts a jdbc server (not odbc), but there's a
 free odbc connector from databricks built by Simba, with which I've been
 able to connect to a spark cluster hosted on linux.

 -Ashic.

 --
 To: user@spark.apache.org
 From: forch...@gmail.com
 Subject: Spark SQL odbc on Windows
 Date: Sun, 22 Feb 2015 09:45:03 +0100


 Hello,
 I work on a MS consulting company and we are evaluating including SPARK on
 our BigData offer. We are particulary interested into testing SPARK as
 rolap engine for SSAS but we cannot find a way to activate the odbc server
 (thrift) on a Windows custer. There is no start-thriftserver.sh command
 available for windows.

 Somebody knows if there is a way to make this work?

 Thanks in advance!!
 Francisco



Re: Spark performance tuning

2015-02-22 Thread Akhil Das
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html

Thanks
Best Regards

On Sun, Feb 22, 2015 at 1:14 AM, java8964 java8...@hotmail.com wrote:

 Can someone share some ideas about how to tune the GC time?

 Thanks

 --
 From: java8...@hotmail.com
 To: user@spark.apache.org
 Subject: Spark performance tuning
 Date: Fri, 20 Feb 2015 16:04:23 -0500


 Hi,

 I am new to Spark, and I am trying to test the Spark SQL performance vs
 Hive. I setup a standalone box, with 24 cores and 64G memory.

 We have one SQL in mind to test. Here is the basically setup on this one
 box for the SQL we are trying to run:

 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest
 structure of 3 array of struct in AVRO
 2) Dataset2, 5G AVRO file with snappy compression
 3) Dataset3, 2.3M AVRO file with snappy compression.

 The basic structure of the query is like this:


 (select
 xxx
 from
 dataset1 lateral view outer explode(struct1) lateral view outer
 explode(struct2)
 where x )
 left outer join
 (
 select  from dataset2 lateral view explode(xxx) where 
 )
 on 
 left outer join
 (
 select xxx from dataset3 where )
 on x

 So overall what it does is 2 outer explode on dataset1, left outer join
 with explode of dataset2, then finally left outer join with dataset 3.

 On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark
 1.2.0.

 Baseline, the above query can finish around 50 minutes in Hive 12, with 6
 mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.

 This is a very expensive query running in our production, of course with
 much bigger data set, every day. Now I want to see how fast Spark can do
 for the same query.

 I am using the following settings, based on my understanding of Spark, for
 a fair test between it and Hive:

 export SPARK_WORKER_MEMORY=32g
 export SPARK_DRIVER_MEMORY=2g
 --executor-memory 9g
 --total-executor-cores 9

 I am trying to run the one executor with 9 cores and max 9G heap, to make
 Spark use almost same resource we gave to the MapReduce.
 Here is the result without any additional configuration changes, running
 under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same
 query:

 The Spark SQL generated 5 stage of tasks, shown below:
 4   collect at SparkPlan.scala:84 +details  2015/02/20 10:48:46 *26 s*
200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 *16
 min*  200/200 1112.3 MB
 2   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *9
 min*  40/40   4.7 GB  22.2 GB
 1   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *1.9*
 min 50/50   6.2 GB  2.8 GB
 0   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *6 s*
   2/2 2.3 MB  156.6 KB

 So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28
 minutes.

 It is about 56% of originally time, not bad. But I want to know any tuning
 of Spark can make it even faster.

 For stage 2 and 3, I observed that GC time is more and more expensive.
 Especially in stage 3, shown below:

 For stage 3:
 Metric  Min 25th percentile Median  75th percentile Max
 Duration20 s30 s35 s39 s
  2.4 min
 GC Time 9 s 17 s20 s25 s
  2.2 min
 Shuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB
  8.3 MB

 So in median, the GC time took overall 20s/35s = 57% of time.

 First change I made is to add the following line in the spark-default.conf:
 spark.serializer org.apache.spark.serializer.KryoSerializer

 My assumption is that using kryoSerializer, instead of default java
 serialize, will lower the memory footprint, should lower the GC pressure
 during runtime. I know the I changed the correct spark-default.conf,
 because if I were add spark.executor.extraJavaOptions -verbose:gc
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in the same file, I will see
 the GC usage in the stdout file. Of course, in this test, I didn't add
 that, as I want to only make one change a time.
 The result is almost the same, as using standard java serialize. The wall
 time is still 28 minutes, and in stage 3, the GC still took around 50 to
 60% of time, almost same result within min, median to max in stage 3,
 without any noticeable performance gain.

 Next, based on my understanding, and for this test, I think the
 default spark.storage.memoryFraction is too high for this query, as there
 is no reason to reserve so much memory for caching data, Because we don't
 reuse any dataset in this one query. So I add this at the end of
 spark-shell command --conf spark.storage.memoryFraction=0.3, as I want to
 just reserve half of the memory for caching data vs first time. Of course,
 this time, I rollback the first change of KryoSerializer.

 The result looks like almost the same. The whole query finished around 28s
 + 14m 

Re: Broadcasting Large Objects Fails?

2015-02-22 Thread Akhil Das
Did you try with torrent broadcast factory?

Thanks
Best Regards

On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote:

 Hi,

 I am trying to broadcast large objects (order of a couple of 100 MBs).
 However, I keep getting errors when trying to do so:

 Traceback (most recent call last):
   File /LORM_experiment.py, line 510, in module
 broadcast_gradient_function = sc.broadcast(gradient_function)
   File /scratch/users/213444/spark/python/pyspark/context.py, line 643,
 in
 broadcast
 return Broadcast(self, value, self._pickled_broadcast_vars)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65,
 in __init__
 self._path = self.dump(value, f)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82,
 in dump
 cPickle.dump(value, f, 2)
 SystemError: error return without exception set
 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark
 local dirs
 java.lang.IllegalStateException: Shutdown in progress

 Any idea how to prevent that? I got plenty of RAM, so there shouldn't be
 any
 problem with that.

 Thanks,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Broadcasting Large Objects Fails?

2015-02-22 Thread Tassilo Klein
I see, thanks. Yes, I have tried already all sorts of changes to these
parameters. Unfortunately, none of seem had any impact.

Thanks,
 Tassilo

On Sun, Feb 22, 2015 at 1:24 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Yes it is, you have some more customizable options over here
 http://spark.apache.org/docs/1.2.0/configuration.html#compression-and-serialization


 Thanks
 Best Regards

 On Sun, Feb 22, 2015 at 11:47 PM, Tassilo Klein tjkl...@gmail.com wrote:

 Hi Akhil,

 thanks for your reply. I am using the latest version of Spark 1.2.1 (also
 tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is
 the default there, isn't it?

 Thanks,
  Tassilo

 On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Did you try with torrent broadcast factory?

 Thanks
 Best Regards

 On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote:

 Hi,

 I am trying to broadcast large objects (order of a couple of 100 MBs).
 However, I keep getting errors when trying to do so:

 Traceback (most recent call last):
   File /LORM_experiment.py, line 510, in module
 broadcast_gradient_function = sc.broadcast(gradient_function)
   File /scratch/users/213444/spark/python/pyspark/context.py, line
 643, in
 broadcast
 return Broadcast(self, value, self._pickled_broadcast_vars)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line
 65,
 in __init__
 self._path = self.dump(value, f)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line
 82,
 in dump
 cPickle.dump(value, f, 2)
 SystemError: error return without exception set
 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark
 local dirs
 java.lang.IllegalStateException: Shutdown in progress

 Any idea how to prevent that? I got plenty of RAM, so there shouldn't
 be any
 problem with that.

 Thanks,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Back to thrift, there was an earlier thread on this topic at
http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
that may be useful as well.

On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote:

 Hi Francisco,

 Out of curiosity - why ROLAP mode using multi-dimensional mode (vs
 tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my
 interest.

 The one thing that you may run into is that the SQL generated by SSAS can
 be quite convoluted. When we were doing the same thing to try to get SSAS
 to connect to Hive (ref paper at
 http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
 that was definitely a blocker. Note that Spark SQL is different than HIVEQL
 but you may run into the same issue. If so, the trick you may want to use
 is similar to the paper - use a SQL Server linked server connection and
 have SQL Server be your translator for the SQL generated by SSAS.

 HTH!
 Denny

 On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:

 Hi Francisco,
 While I haven't tried this, have a look at the contents of
 start-thriftserver.sh - all it's doing is setting up a few variables and
 calling:

 /bin/spark-submit --class org.apache.spark.sql.hive.
 thriftserver.HiveThriftServer2

 and passing some additional parameters. Perhaps doing the same would work?

 I also believe that this hosts a jdbc server (not odbc), but there's a
 free odbc connector from databricks built by Simba, with which I've been
 able to connect to a spark cluster hosted on linux.

 -Ashic.

 --
 To: user@spark.apache.org
 From: forch...@gmail.com
 Subject: Spark SQL odbc on Windows
 Date: Sun, 22 Feb 2015 09:45:03 +0100


 Hello,
 I work on a MS consulting company and we are evaluating including SPARK
 on our BigData offer. We are particulary interested into testing SPARK as
 rolap engine for SSAS but we cannot find a way to activate the odbc server
 (thrift) on a Windows custer. There is no start-thriftserver.sh command
 available for windows.

 Somebody knows if there is a way to make this work?

 Thanks in advance!!
 Francisco




Re: Broadcasting Large Objects Fails?

2015-02-22 Thread Akhil Das
Yes it is, you have some more customizable options over here
http://spark.apache.org/docs/1.2.0/configuration.html#compression-and-serialization


Thanks
Best Regards

On Sun, Feb 22, 2015 at 11:47 PM, Tassilo Klein tjkl...@gmail.com wrote:

 Hi Akhil,

 thanks for your reply. I am using the latest version of Spark 1.2.1 (also
 tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is
 the default there, isn't it?

 Thanks,
  Tassilo

 On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Did you try with torrent broadcast factory?

 Thanks
 Best Regards

 On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote:

 Hi,

 I am trying to broadcast large objects (order of a couple of 100 MBs).
 However, I keep getting errors when trying to do so:

 Traceback (most recent call last):
   File /LORM_experiment.py, line 510, in module
 broadcast_gradient_function = sc.broadcast(gradient_function)
   File /scratch/users/213444/spark/python/pyspark/context.py, line
 643, in
 broadcast
 return Broadcast(self, value, self._pickled_broadcast_vars)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line
 65,
 in __init__
 self._path = self.dump(value, f)
   File /scratch/users/213444/spark/python/pyspark/broadcast.py, line
 82,
 in dump
 cPickle.dump(value, f, 2)
 SystemError: error return without exception set
 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark
 local dirs
 java.lang.IllegalStateException: Shutdown in progress

 Any idea how to prevent that? I got plenty of RAM, so there shouldn't be
 any
 problem with that.

 Thanks,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Spark SQL odbc on Windows

2015-02-22 Thread Francisco Orchard
Hello, 
I work on a MS consulting company and we are evaluating including SPARK on our 
BigData offer. We are particulary interested into testing SPARK as rolap engine 
for SSAS but we cannot find a way to activate the odbc server (thrift) on a 
Windows custer. There is no start-thriftserver.sh command available for 
windows. 

Somebody knows if there is a way to make this work? 

Thanks in advance!!
Francisco

Re: Missing shuffle files

2015-02-22 Thread Sameer Farooqui
Do you guys have dynamic allocation turned on for YARN?

Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?

If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the previous stage to also be
re-computed if it's a wide dependency.

But like Petar said, if you turn the external shuffle service on, YARN
NodeManager process on the slave machines will serve out the map spill
data, instead of the Executor JVMs (by default unless you turn external
shuffle on, the Executor JVM itself serves out the shuffle data which
causes problems if an Executor dies).

Core, how often are Executors crashing in your app? How many Executors do
you have total? And what is the memory size for each? You can change what
fraction of the Executor heap will be used for your user code vs the
shuffle vs RDD caching with the spark.storage.memoryFraction setting.

On Sat, Feb 21, 2015 at 2:58 PM, Petar Zecevic petar.zece...@gmail.com
wrote:


 Could you try to turn on the external shuffle service?

 spark.shuffle.service.enable = true


 On 21.2.2015. 17:50, Corey Nolet wrote:

 I'm experiencing the same issue. Upon closer inspection I'm noticing that
 executors are being lost as well. Thing is, I can't figure out how they are
 dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
 allocated for the application. I was thinking perhaps it was possible that
 a single executor was getting a single or a couple large partitions but
 shouldn't the disk persistence kick in at that point?

 On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com
 wrote:

 For large jobs, the following error message is shown that seems to
 indicate that shuffle files for some reason are missing. It's a rather
 large job with many partitions. If the data size is reduced, the problem
 disappears. I'm running a build from Spark master post 1.2 (build at
 2015-01-16) and running on Yarn 2.2. Any idea of how to resolve this
 problem?

  User class threw exception: Job aborted due to stage failure: Task 450
 in stage 450.1 failed 4 times, most recent failure: Lost task 450.3 in
 stage 450.1 (TID 167370, lon4-hadoopslave-b77.lon4.spotify.net):
 java.io.FileNotFoundException:
 /disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450
 (No such file or directory)
  at java.io.FileOutputStream.open(Native Method)
  at java.io.FileOutputStream.(FileOutputStream.java:221)
  at java.io.FileOutputStream.(FileOutputStream.java:171)
  at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76)
  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786)
  at
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
  at
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149)
  at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
  at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

  at java.lang.Thread.run(Thread.java:745)

  TIA,
 Anders






Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread Tom Vacek
The SparkConf doesn't allow you to set arbitrary variables.  You can use
SparkContext's HadoopRDD and create a JobConf (with whatever variables you
want), and then grab them out of the JobConf in your RecordReader.

On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote:

 Hi,

 I have written custom InputFormat and RecordReader for Spark, I need  to
 use
 user variables from spark client program.

 I added them in SparkConf

  val sparkConf = new
 SparkConf().setAppName(args(0)).set(developer,MyName)

 *and in InputFormat class*

 protected boolean isSplitable(JobContext context, Path filename) {


 System.out.println(# Developer 
 + context.getConfiguration().get(developer) );
 return false;
 }

 but its return me *null* , is there any way I can pass user variables to my
 custom code?

 Thanks !!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Posting to the list

2015-02-22 Thread hnahak
I'm also facing the same issue, this is third time whenever I post anything
it never accept by the community and at the same time got a failure mail in
my register mail id. 

and when click to subscribe to this mailing list link, i didnt get any new
subscription mail in my inbox.
  
Please anyone suggest a best way to subscribed the email ID   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-22 Thread Ted Yu
Haven't found the method in
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD

The new DataFrame has this method:

  /**
   * Returns the content of the [[DataFrame]] as an [[RDD]] of [[Row]]s.
   * @group rdd
   */
  def rdd: RDD[Row] = {

FYI

On Sun, Feb 22, 2015 at 11:51 AM, stephane.collot stephane.col...@gmail.com
 wrote:

 Hi Michael,

 I think that the feature (convert a SchemaRDD to a structured class RDD) is
 now available. But I didn't understand in the PR how exactly to do this.
 Can
 you give an example or doc links?

 Best regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread hnahak
Hi, 

I have written custom InputFormat and RecordReader for Spark, I need  to use
user variables from spark client program. 

I added them in SparkConf 

 val sparkConf = new
SparkConf().setAppName(args(0)).set(developer,MyName)

*and in InputFormat class*

protected boolean isSplitable(JobContext context, Path filename) {

System.out.println(# 
Developer 
+ context.getConfiguration().get(developer) );
return false;
}

but its return me *null* , is there any way I can pass user variables to my
custom code? 

Thanks !!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-22 Thread stephane.collot
Hi Michael,

I think that the feature (convert a SchemaRDD to a structured class RDD) is
now available. But I didn't understand in the PR how exactly to do this. Can
you give an example or doc links?

Best regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Posting to the list

2015-02-22 Thread Ted Yu
bq. i didnt get any new subscription mail in my inbox.

Have you checked your Spam folder ?

Cheers

On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote:

 I'm also facing the same issue, this is third time whenever I post anything
 it never accept by the community and at the same time got a failure mail in
 my register mail id.

 and when click to subscribe to this mailing list link, i didnt get any
 new
 subscription mail in my inbox.

 Please anyone suggest a best way to subscribed the email ID



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to integrate HBASE on Spark

2015-02-22 Thread sandeep vura
Hi

I had installed spark on 3 node cluster. Spark services are up and
running.But i want to integrate hbase on spark

Do i need to install HBASE on hadoop cluster or spark cluster.

Please let me know asap.

Regards,
Sandeep.v


Submitting jobs to Spark EC2 cluster remotely

2015-02-22 Thread olegshirokikh
I've set up the EC2 cluster with Spark. Everything works, all master/slaves
are up and running.

I'm trying to submit a sample job (SparkPi). When I ssh to cluster and
submit it from there - everything works fine. However when driver is created
on a remote host (my laptop), it doesn't work. I've tried both modes for
`--deploy-mode`:

**`--deploy-mode=client`:**

From my laptop:

./bin/spark-submit --master
spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class
SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

Results in the following indefinite warnings/errors:

  WARN TaskSchedulerImpl: Initial job has not accepted any resources;
 check your cluster UI to ensure that workers are registered and have
 sufficient memory 15/02/22 18:30:45 

 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
 15/02/22 18:30:45 

 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1

...and failed drivers - in Spark Web UI Completed Drivers with
State=ERROR appear.

I've tried to pass limits for cores and memory to submit script but it
didn't help...

**`--deploy-mode=cluster`:**

From my laptop:

./bin/spark-submit --master
spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode
cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

The result is:

  Driver successfully submitted as driver-20150223023734-0007 ...
 waiting before polling master for driver state ... polling master for
 driver state State of driver-20150223023734-0007 is ERROR Exception
 from cluster was: java.io.FileNotFoundException: File
 file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
 does not exist. java.io.FileNotFoundException: File
 file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
 does not exist.   at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
   at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)at
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
   at
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)

 So, I'd appreciate any pointers on what is going wrong and some guidance
how to deploy jobs from remote client. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-to-Spark-EC2-cluster-remotely-tp21762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cannot run spark shell in yarn-client mode

2015-02-22 Thread quangnguyenbh
Does anyone fix this error ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-run-spark-shell-in-yarn-client-mode-tp4013p21761.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to integrate HBASE on Spark

2015-02-22 Thread Akhil Das
If you are having both the clusters on the same network, then i'd suggest
you installing it on the hadoop cluster. If you install it on the spark
cluster itself, then hbase might take up a few cpu cycles and there's a
chance for the job to lag.

Thanks
Best Regards

On Mon, Feb 23, 2015 at 12:48 PM, sandeep vura sandeepv...@gmail.com
wrote:

 Hi

 I had installed spark on 3 node cluster. Spark services are up and
 running.But i want to integrate hbase on spark

 Do i need to install HBASE on hadoop cluster or spark cluster.

 Please let me know asap.

 Regards,
 Sandeep.v



Re: Posting to the list

2015-02-22 Thread haihar nahak
I checked it but I didn't see any mail from user list. Let me do it one
more time.

[image: Inline image 1]

--Harihar

On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. i didnt get any new subscription mail in my inbox.

 Have you checked your Spam folder ?

 Cheers

 On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote:

 I'm also facing the same issue, this is third time whenever I post
 anything
 it never accept by the community and at the same time got a failure mail
 in
 my register mail id.

 and when click to subscribe to this mailing list link, i didnt get any
 new
 subscription mail in my inbox.

 Please anyone suggest a best way to subscribed the email ID



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
{{{H2N}}}-(@:


Re: Use Spark Streaming for Batch?

2015-02-22 Thread Tobias Pfeiffer
Hi,

On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote:
  /Might it be possible to perform large batches processing on HDFS time
  series data using Spark Streaming?/
 
  1.I understand that there is not currently an InputDStream that could do
  what's needed.  I would have to create such a thing.
  2. Time is a problem.  I would have to use the timestamps on our events
 for
  any time-based logic and state management
  3. The batch duration would become meaningless in this scenario.
 Could I
  just set it to something really small (say 1 second) and then let it
 fall
  behind, processing the data as quickly as it could?


So, if it is not an issue for you if everything is processed in one batch,
you can use streamingContext.textFileStream(hdfsDirectory). This will
create a DStream that has a huge RDD with all data in the first batch and
then empty batches afterwards. You can have small batch size, should not be
a problem.
An alternative would be to write some code that creates one RDD per file in
your HDFS directory, create a Queue of those RDDs and then use
streamingContext.queueStream(), possibly with the oneAtATime=true parameter
(which will process only one RDD per batch).

However, to do window computations etc with the timestamps embedded *in*
your data will be a major effort, as in: You cannot use the existing
windowing functionality from Spark Streaming. If you want to read more
about that, there have been a number of discussions about that topic on
this list; maybe you can look them up.

Tobias


Re: Any sample code for Kafka consumer

2015-02-22 Thread Tathagata Das
Spark Streaming already directly supports Kafka
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Is there any reason why that is not sufficient?

TD

On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote:

 In java, you can see this example:
 https://github.com/mykidong/spark-kafka-simple-consumer-receiver

 - Kidong.

 -- Original Message --
 From: icecreamlc [via Apache Spark User List] [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=21758i=0
 To: mykidong [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=21758i=1
 Sent: 2015-02-21 오전 11:16:37
 Subject: Any sample code for Kafka consumer


 Dear all,

 Do you have any sample code that consume data from Kafka server using
 spark streaming? Thanks a lot!!!

  --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746.html
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


 --
 View this message in context: Re: Any sample code for Kafka consumer
 http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746p21758.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-22 Thread Ted Yu
bq. bash: git: command not found

Looks like the AMI doesn't have git pre-installed.

Cheers

On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote:

 I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu)
 using
 the following:

 ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
 --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
 spark-ubuntu-cluster

 Everything starts OK and instances are launched:

 Found 1 master(s), 2 slaves
 Waiting for all instances in cluster to enter 'ssh-ready' state.
 Generating cluster's SSH key on master.

 But then I'm getting the following SSH errors until it stops trying and
 quits:

 bash: git: command not found
 Connection to ***.us-west-2.compute.amazonaws.com closed.
 Error executing remote command, retrying after 30 seconds: Command '['ssh',
 '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
 'UserKnownHostsFile=/dev/null', '-t', '-t',
 u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
 clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero
 exit
 status 127

 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
 but still, it should work... Any advice would be greatly appreciated!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread haihar nahak
Thanks. I extract hadoop configuration and set a my arbitrary variable and
able to get inside InputFormat from JobContext.configuration

On Mon, Feb 23, 2015 at 12:04 PM, Tom Vacek minnesota...@gmail.com wrote:

 The SparkConf doesn't allow you to set arbitrary variables.  You can use
 SparkContext's HadoopRDD and create a JobConf (with whatever variables you
 want), and then grab them out of the JobConf in your RecordReader.

 On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote:

 Hi,

 I have written custom InputFormat and RecordReader for Spark, I need  to
 use
 user variables from spark client program.

 I added them in SparkConf

  val sparkConf = new
 SparkConf().setAppName(args(0)).set(developer,MyName)

 *and in InputFormat class*

 protected boolean isSplitable(JobContext context, Path filename) {


 System.out.println(# Developer 
 + context.getConfiguration().get(developer) );
 return false;
 }

 but its return me *null* , is there any way I can pass user variables to
 my
 custom code?

 Thanks !!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
{{{H2N}}}-(@:


Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread hnahak
Instead of setting in SparkConf , set it into
SparkContext.hadoopconfiguration.set(key,value) 

and from JobContext extract same key.

--Harihar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755p21759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-22 Thread olegshirokikh
I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using
the following:

./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
--region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
--instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
spark-ubuntu-cluster

Everything starts OK and instances are launched:

Found 1 master(s), 2 slaves
Waiting for all instances in cluster to enter 'ssh-ready' state.
Generating cluster's SSH key on master.

But then I'm getting the following SSH errors until it stops trying and
quits:

bash: git: command not found
Connection to ***.us-west-2.compute.amazonaws.com closed.
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
'UserKnownHostsFile=/dev/null', '-t', '-t',
u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit
status 127

I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work... Any advice would be greatly appreciated!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any sample code for Kafka consumer

2015-02-22 Thread mykidong
In java, you can see this example:
https://github.com/mykidong/spark-kafka-simple-consumer-receiver

- Kidong.

-- Original Message --
From: icecreamlc [via Apache Spark User List] 
ml-node+s1001560n21746...@n3.nabble.com
To: mykidong mykid...@gmail.com
Sent: 2015-02-21 오전 11:16:37
Subject: Any sample code for Kafka consumer

Dear all,

Do you have any sample code that consume data from Kafka server using 
spark streaming? Thanks a lot!!!


If you reply to this email, your message will be added to the 
discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746.html
To unsubscribe from Apache Spark User List, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746p21758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Use Spark Streaming for Batch?

2015-02-22 Thread Soumitra Kumar
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch
has been accepted and, this enhancement is scheduled for 1.3.0.

This lets you specify initialRDD for updateStateByKey operation. Let me
know if you need any information.

On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com
 wrote:
  /Might it be possible to perform large batches processing on HDFS time
  series data using Spark Streaming?/
 
  1.I understand that there is not currently an InputDStream that could do
  what's needed.  I would have to create such a thing.
  2. Time is a problem.  I would have to use the timestamps on our events
 for
  any time-based logic and state management
  3. The batch duration would become meaningless in this scenario.
 Could I
  just set it to something really small (say 1 second) and then let it
 fall
  behind, processing the data as quickly as it could?


 So, if it is not an issue for you if everything is processed in one batch,
 you can use streamingContext.textFileStream(hdfsDirectory). This will
 create a DStream that has a huge RDD with all data in the first batch and
 then empty batches afterwards. You can have small batch size, should not be
 a problem.
 An alternative would be to write some code that creates one RDD per file
 in your HDFS directory, create a Queue of those RDDs and then use
 streamingContext.queueStream(), possibly with the oneAtATime=true parameter
 (which will process only one RDD per batch).

 However, to do window computations etc with the timestamps embedded *in*
 your data will be a major effort, as in: You cannot use the existing
 windowing functionality from Spark Streaming. If you want to read more
 about that, there have been a number of discussions about that topic on
 this list; maybe you can look them up.

 Tobias