date:20150402

Yes, any Hadoop-related process that asks for Snappy compression or
needs to read it will have to have the Snappy libs available on the
library path. That's usually set up for you in a distro or you can do
it manually like this. This is not Spark-specific.

The second question also isn't Spark-specific; you do not have a
SequenceFile of byte[] / String, but of byte[] / byte[]. Review what
you are writing since it is not BytesWritable / Text.

On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com wrote:
 I'm actually running this in a separate environment to our HDFS cluster.

 I think I've been able to sort out the issue by copying
 /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just
 using a one-worker setup at present) and adding the following to
 spark-env.sh:

 export
 JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
 export
 SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
 export
 SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar

 I can get past the previous error. The issue now seems to be with what is
 being returned.

 import org.apache.hadoop.io._
 val hdfsPath = hdfs://nost.name/path/to/folder
 val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
 file.count()

 returns the following error:

 java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
 cast to org.apache.hadoop.io.Text


 On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote:

 Do you have the same hadoop config for all nodes in your cluster(you run
 it in a cluster, right?)?
 Check the node(usually the executor) which gives the
 java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the
 hadoop native lib path.

 On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote:

 Thanks for the super quick response!

 I can read the file just fine in hadoop, it's just when I point Spark at
 this file it can't seem to read it due to the missing snappy jars / so's.

 I'l paying around with adding some things to spark-env.sh file, but still
 nothing.

 On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote:

 Can you read snappy compressed file in hdfs?  Looks like the libsnappy.so
 is not in the hadoop native lib path.

 On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:

 Has anyone else encountered the following error when trying to read a
 snappy
 compressed sequence file from HDFS?

 *java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*

 The following works for me when the file is uncompressed:

 import org.apache.hadoop.io._
 val hdfsPath = hdfs://nost.name/path/to/folder
 val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
 file.count()

 but fails when the encoding is Snappy.

 I've seen some stuff floating around on the web about having to explicitly
 enable support for Snappy in spark, but it doesn't seem to work for me:
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark
 http://www.ericlin.me/enabling-snappy-support-for-sharkspark



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to learn Spark ?

Hi, all

 

I am new to here. Could you give me some suggestion to learn Spark ? Thanks.

 

Best Regards,

Star Guo

Support for Data flow graphs and not DAG only

2015-04-02 Thread anshu shukla

Hey ,

I  didn't  find any documentation  regarding support for  cycles in spark
topology , although storm supports  this using manual  configuration in
acker function logic (setting it to a particular count) .By cycles  i
 doesn't mean infinite loops .

-- 
Thanks  Regards,
Anshu Shukla

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip

Thanks Xiangrui,

I used 80 iterations to demonstrates the marginal diminishing return in
prediction quality :)

Justin
On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote:

 I think before 1.3 you also get stackoverflow problem in  ~35
 iterations. In 1.3.x, please use setCheckpointInterval to solve this
 problem, which is available in the current master and 1.3.1 (to be
 released soon). Btw, do you find 80 iterations are needed for
 convergence? -Xiangrui

 On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io
 wrote:
  Hello,
 
  I have been using Mllib's ALS in 1.2 and it works quite well. I have just
  upgraded to 1.3 and I encountered stackoverflow problem.
 
  After some digging, I realized that when the iteration  ~35, I will get
  overflow problem. However, I can get at least 80 iterations with ALS in
 1.2.
 
  Is there any change to the ALS algorithm? And are there any ways to
 achieve
  more iterations?
 
  Thanks.
 
  Justin

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

how to find near duplicate items from given dataset using spark

2015-04-02 Thread Somnath Pandeya

Hi All,

I want to find near duplicate items from given dataset
For e.g consider a data set

1.   Cricket,bat,ball,stumps

2.   Cricket,bowler,ball,stumps,

3.   Football,goalie,midfielder,goal

4.   Football,refree,midfielder,goal,
Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are 
near duplicates(only 2 field is different)

This is what I did
Created an Article class and implemented equls and hashcode method (my hash 
code method returns constant (1) for all objecst).
And in spark I am using article as a key doing group by on the article.
Is this approach correct, or is there any better approach.

This is how my code looks like.

Article Class
public class Article implements Serializable {

private static final long serialVersionUID = 1L;
   private String first;
   private String second;
   private String third;
   private String fourth;

   public Article() {
  set(, , , );
   }

   public Article(String first, String second, String third, String fourth) 
{
  // super();
  set(first, second, third, fourth);
   }

   @Override
   public int hashCode() {
  int result = 1;
  return result;
   }

   @Override
   public boolean equals(Object obj) {
  if (this == obj)
 return true;
  if (obj == null)
 return false;
  if (getClass() != obj.getClass())
 return false;
  Article other = (Article) obj;
  if ((first.equals(other.first) || second.equals(other.second)
   || third.equals(other.third) || 
fourth.equals(other.fourth))) {
 return true;
  } else {
 return false;
  }
   }

   private void set(String first, String second, String third, String 
fourth) {
  this.first = first;
  this.second = second;
  this.third = third;
  this.fourth = fourth;
   }


Spark Code
   public static void main(String[] args) throws Exception {

  SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount)
   .setMaster(local);
  JavaSparkContext ctx = new JavaSparkContext(sparkConf);
  JavaRDDString lines = ctx.textFile(data1/*);

  JavaRDDArticle articles = lines.map(new FunctionString, 
Article() {

 /**
  *
  */
 private static final long serialVersionUID = 1L;

 public Article call(String line) throws Exception {
   String[] words = line.split(,);
   // System.out.println(line);

   Article article = new Article(words[0], words[1], 
words[2],
 words[3]);

   return article;
 }
  });


  JavaPairRDDArticle, String articlePair = lines
   .mapToPair(new PairFunctionString, Article, 
String() {

  public Tuple2Article, String call(String 
line)
throws Exception {

 String[] words = line.split(,);
 // System.out.println(line);

 Article article = new 
Article(words[0], words[1],
   words[2], words[3]);
 return new Tuple2Article, 
String(article, line);
  }
   });

  JavaPairRDDArticle, IterableString articlePairs = articlePair
   .groupByKey();


  MapArticle, IterableString dupArticles = articlePairs
   .collectAsMap();

  System.out.println(size {}  + dupArticles.size());

  SetArticle uniqueArticle = dupArticles.keySet();

  for (Article article : uniqueArticle) {
 IterableString temps = dupArticles.get(article);
 System.out.println(keys  + article);
 for (String string : temps) {
   System.out.println(string);
 }
 System.out.println(==);
  }
  ctx.close();
  ctx.stop();
   }
}


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or

Re: pyspark hbase range scan

2015-04-02 Thread gen tang

Hi,

Maybe this might be helpful:
https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala

Cheers
Gen

On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com
 wrote:

 I am attempting to read an hbase table in pyspark with a range scan.

 conf = {
 hbase.zookeeper.quorum: host,
 hbase.mapreduce.inputtable: table,
 hbase.mapreduce.scan : scan
 }
 hbase_rdd = sc.newAPIHadoopRDD(
 org.apache.hadoop.hbase.mapreduce.TableInputFormat,
 org.apache.hadoop.hbase.io.ImmutableBytesWritable,
 org.apache.hadoop.hbase.client.Result,
 keyConverter=keyConv,
 valueConverter=valueConv,
 conf=conf)

 If i jump over to scala or java and generate a base64 encoded protobuf scan
 object and convert it to a string, i can use that value for
 hbase.mapreduce.scan and everything works,  the rdd will correctly
 perform
 the range scan and I am happy.  The problem is that I can not find any
 reasonable way to generate that range scan string in python.   The scala
 code required is:

 import org.apache.hadoop.hbase.util.Base64;
 import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
 import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put,
 Result = HBaseResult, Scan}

 val scan = new Scan()
 scan.setStartRow(test_domain\0email.getBytes)
 scan.setStopRow(test_domain\0email~.getBytes)
 def scanToString(scan:Scan): String = { Base64.encodeBytes(
 ProtobufUtil.toScan(scan).toByteArray()) }
 scanToString(scan)


 Is there another way to perform an hbase range scan from pyspark or is that
 functionality something that might be supported in the future?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming anomaly detection using ARIMA

This inside out parallelization has been a way people have used R
with MapReduce for a long time. Run N copies of an R script on the
cluster, on different subsets of the data, babysat by Mappers. You
just need R installed on the cluster. Hadoop Streaming makes this easy
and things like RDD.pipe in Spark make it easier.

So it may be just that simple and so there's not much to say about it.
I haven't tried this with Spark Streaming but imagine it would also
work. Have you tried this?

Within a window you would probably take the first x% as training and
the rest as test. I don't think there's a question of looking across
windows.

On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote:
 Surprised I haven't gotten any responses about this. Has anyone tried using
 rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
 other way- what I'd like to do is use R for model calculation and Spark to
 distribute the load across the cluster.

 Also, has anyone used Scalation for ARIMA models?

 On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:

 Taking out the complexity of the ARIMA models to simplify things- I can't
 seem to find a good way to represent even standard moving averages in spark
 streaming. Perhaps it's my ignorance with the micro-batched style of the
 DStreams API.

 On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:

 I want to use ARIMA for a predictive model so that I can take time series
 data (metrics) and perform a light anomaly detection. The time series data
 is going to be bucketed to different time units (several minutes within
 several hours, several hours within several days, several days within
 several years.

 I want to do the algorithm in Spark Streaming. I'm used to tuple at a
 time streaming and I'm having a tad bit of trouble gaining insight into how
 exactly the windows are managed inside of DStreams.

 Let's say I have a simple dataset that is marked by a key/value tuple
 where the key is the name of the component who's metrics I want to run the
 algorithm against and the value is a metric (a value representing a sum for
 the time bucket. I want to create histograms of the time series data for
 each key in the windows in which they reside so I can use that histogram
 vector to generate my ARIMA prediction (actually, it seems like this doesn't
 just apply to ARIMA but could apply to any sliding average).

 I *think* my prediction code may look something like this:

 val predictionAverages = dstream
   .groupByKeyAndWindow(60*60*24, 60*60*24)
   .mapValues(applyARIMAFunction)

 That is, keep 24 hours worth of metrics in each window and use that for
 the ARIMA prediction. The part I'm struggling with is how to join together
 the actual values so that i can do my comparison against the prediction
 model.

 Let's say dstream contains the actual values. For any time  window, I
 should be able to take a previous set of windows and use model to compare
 against the current values.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JAVA_HOME problem

2015-04-02 Thread 董帅阳

spark 1.3.0




spark@pc-zjqdyyn1:~ tail /etc/profile
export JAVA_HOME=/usr/jdk64/jdk1.7.0_45
export PATH=$PATH:$JAVA_HOME/bin


#
# End of /etc/profile
#‍





But ERROR LOG


Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454

LogType: stderr
LogLength: 61
Log Contents:
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory


LogType: stdout
LogLength: 0
Log Contents:‍

Re: Spark throws rsync: change_dir errors on startup

2015-04-02 Thread Horsmann, Tobias

Hi,
Verbose output showed no additional information about the origin of the error

rsync from right
sending incremental file list

sent 20 bytes  received 12 bytes  64.00 bytes/sec
total size is 0  speedup is 0.00
starting org.apache.spark.deploy.master.Master, logging to 
/usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.master.Master-1-cl-pc6.out
left: rsync from right
left: rsync: change_dir /usr/local/spark130//right failed: No such file or 
directory (2)
left: rsync error: some files/attrs were not transferred (see previous errors) 
(code 23) at main.c(1183) [sender=3.1.0]
left: starting org.apache.spark.deploy.worker.Worker, logging to 
/usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.worker.Worker-1-cl-pc5.out
right: rsync from right
right: sending incremental file list
right: rsync: change_dir /usr/local/spark130//right failed: No such file or 
directory (2)
right:
right: sent 20 bytes  received 12 bytes  64.00 bytes/sec
right: total size is 0  speedup is 0.00
right: rsync error: some files/attrs were not transferred (see previous errors) 
(code 23) at main.c(1183) [sender=3.1.0]
right: starting org.apache.spark.deploy.worker.Worker, logging to 
/usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.worker.Worker-1-cl-pc6.out

I also edited the script to remove the additional slash, but this did not help 
either. The workers are basically started by the script it is just this error 
message that is thrown.

Now, funny thing. I was so brave to create the folder //right Spark is 
desperately looking for. Guess what, this caused to a complete wipe of my local 
spark installation /usr/local/spark130 was cleaned completely expect for the 
logs folder….

Any suggestions what is happening here?

Von: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com
Datum: Donnerstag, 2. April 2015 07:51
An: Tobias Horsmann 
tobias.horsm...@uni-due.demailto:tobias.horsm...@uni-due.de
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Betreff: Re: Spark throws rsync: change_dir errors on startup

Error 23 is defined as a partial transfer and might be caused by filesystem 
incompatibilities, such as different character sets or access control lists. In 
this case it could be caused by the double slashes (// at the end of sbin), You 
could try editing your sbin/spark-daemon.sh file, look for rsync inside the 
file, add -v along with that command to see what exactly i going wrong.

Thanks
Best Regards

On Wed, Apr 1, 2015 at 7:25 PM, Horsmann, Tobias 
tobias.horsm...@uni-due.demailto:tobias.horsm...@uni-due.de wrote:
Hi,

I try to set up a minimal 2-node spark cluster for testing purposes. When I 
start the cluster with start-all.sh I get a rsync error message:

rsync: change_dir /usr/local/spark130/sbin//right failed: No such file or 
directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 
23) at main.c(1183) [sender=3.1.0]

(For clarification, my 2 nodes are called ‚right‘ and ‚left‘ referencing to the 
physical machines standing in front of me)
It seems that a file named after my master node ‚right‘ is expected to exist 
and the synchronisation with it fails as it does not exist.
I don’t understand what Spark is trying to do here. Why does it expect this 
file to exist and what content should it have?
 I assume I did something wrong in my configuration setup – can someone 
interpret this error message and has an idea where his error is coming from?

Regards,
Tobias

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Nick Pentreath

Fair enough but I'd say you hit that diminishing return after 20 iterations
or so... :)

On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip yipjus...@gmail.com wrote:

 Thanks Xiangrui,

 I used 80 iterations to demonstrates the marginal diminishing return in
 prediction quality :)

 Justin
 On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote:

 I think before 1.3 you also get stackoverflow problem in  ~35
 iterations. In 1.3.x, please use setCheckpointInterval to solve this
 problem, which is available in the current master and 1.3.1 (to be
 released soon). Btw, do you find 80 iterations are needed for
 convergence? -Xiangrui

 On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io
 wrote:
  Hello,
 
  I have been using Mllib's ALS in 1.2 and it works quite well. I have
 just
  upgraded to 1.3 and I encountered stackoverflow problem.
 
  After some digging, I realized that when the iteration  ~35, I will get
  overflow problem. However, I can get at least 80 iterations with ALS in
 1.2.
 
  Is there any change to the ALS algorithm? And are there any ways to
 achieve
  more iterations?
 
  Thanks.
 
  Justin

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Starting httpd: http: Syntax error on line 154

2015-04-02 Thread Ganon Pierce

I’m unable to access ganglia, it looks like due the web server not starting as 
I receive this error when I launch spark:

Starting httpd: http: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: 
Cannot load /etc/httpd/modules/mod_authz_core.so

This occurs when I’m using the vanilla script. I’ve also tried modifying my 
spark-ec2 script in various ways in an effort to correct this problem including 
using different instance types and modifying the instance virtualization types. 

Thanks for any help!

StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip

Hello,

I have been using Mllib's ALS in 1.2 and it works quite well. I have just
upgraded to 1.3 and I encountered stackoverflow problem.

After some digging, I realized that when the iteration  ~35, I will get
overflow problem. However, I can get at least 80 iterations with ALS in 1.2.

Is there any change to the ALS algorithm? And are there any ways to achieve
more iterations?

Thanks.

Justin

Setup Spark jobserver for Spark SQL

2015-04-02 Thread Harika

Hi,

I am trying to Spark Jobserver(
https://github.com/spark-jobserver/spark-jobserver
https://github.com/spark-jobserver/spark-jobserver  ) for running Spark
SQL jobs.

I was able to start the server but when I run my application(my Scala class
which extends SparkSqlJob), I am getting the following as response:

{
  status: ERROR,
  result: Invalid job type for this context
}

Can any one suggest me what is going wrong or provide a detailed procedure
for setting up jobserver for SparkSQL? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setup-Spark-jobserver-for-Spark-SQL-tp22352.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-02 Thread Xiangrui Meng

I reproduced the bug on master and submitted a patch for it:
https://github.com/apache/spark/pull/5329. It may get into Spark
1.3.1. Thanks for reporting the bug! -Xiangrui

On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Hmm, I got the same error with the master. Here is another test example that
 fails. Here, I explicitly create
 a Row RDD which corresponds to the use case I am in :

 object TestDataFrame {

   def main(args: Array[String]): Unit = {

 val conf = new
 SparkConf().setAppName(TestDataFrame).setMaster(local[4])
 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)

 import sqlContext.implicits._

 val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
 val dataDF = sc.parallelize(data).toDF

 dataDF.printSchema()
 dataDF.save(test1.parquet) // OK

 val dataRow = data.map {case LabeledPoint(l: Double, f:
 mllib.linalg.Vector)=
   Row(l,f)
 }

 val dataRowRDD = sc.parallelize(dataRow)
 val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)

 dataDF2.printSchema()

 dataDF2.saveAsParquetFile(test3.parquet) // FAIL !!!
   }
 }


 On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng men...@gmail.com wrote:

 I cannot reproduce this error on master, but I'm not aware of any
 recent bug fixes that are related. Could you build and try the current
 master? -Xiangrui

 On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Hi all,
 
  DataFrame with an user defined type (here mllib.Vector) created with
  sqlContex.createDataFrame can't be saved to parquet file and raise
  ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be
  cast
  to org.apache.spark.sql.Row error.
 
  Here is an example of code to reproduce this error :
 
  object TestDataFrame {
 
def main(args: Array[String]): Unit = {
  //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
  val conf = new
  SparkConf().setAppName(RankingEval).setMaster(local[8])
.set(spark.executor.memory, 6g)
 
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10
  val dataDF = data.toDF
 
  dataDF.save(test1.parquet)
 
  val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
 
  dataDF2.save(test2.parquet)
}
  }
 
 
  Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and
  how
  can it be solved ?
 
 
  Cheers,
 
 
  Jao



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HiveContext setConf seems not stable

2015-04-02 Thread Hao Ren

Hi,

Jira created: https://issues.apache.org/jira/browse/SPARK-6675

Thank you.


On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com
wrote:

 Can you open a JIRA please?

 On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote:

 Hi,

 I find HiveContext.setConf does not work correctly. Here are some code
 snippets showing the problem:

 snippet 1:

 
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.{SparkConf, SparkContext}

 object Main extends App {

   val conf = new SparkConf()
 .setAppName(context-test)
 .setMaster(local[8])
   val sc = new SparkContext(conf)
   val hc = new HiveContext(sc)

   *hc.setConf(spark.sql.shuffle.partitions, 10)*
 *  hc.setConf(hive.metastore.warehouse.dir,
 /home/spark/hive/warehouse_test)*
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach
 println
 }

 

 *Results:*
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
 (spark.sql.shuffle.partitions,10)

 snippet 2:

 
 ...
   *hc.setConf(hive.metastore.warehouse.dir,
 /home/spark/hive/warehouse_test)*
 *  hc.setConf(spark.sql.shuffle.partitions, 10)*
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach
 println
 ...

 

 *Results:*
 (hive.metastore.warehouse.dir,/user/hive/warehouse)
 (spark.sql.shuffle.partitions,10)

 *You can see that I just permuted the two setConf call, then that leads
 to two different Hive configuration.*
 *It seems that HiveContext can not set a new value on
 hive.metastore.warehouse.dir key in one or the first setConf call.*
 *You need another setConf call before changing
 hive.metastore.warehouse.dir. For example, set
 hive.metastore.warehouse.dir twice and the snippet 1*

 snippet 3:

 
 ...
 *  hc.setConf(hive.metastore.warehouse.dir,
 /home/spark/hive/warehouse_test)*
 *  hc.setConf(hive.metastore.warehouse.dir,
 /home/spark/hive/warehouse_test)*
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
 ...

 

 *Results:*
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)


 *You can reproduce this if you move to the latest branch-1.3
 (1.3.1-snapshot, htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)*

 *I have also tested the released 1.3.0 (htag =
 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.*

 *Please tell me if I am missing something. Any help is highly
 appreciated.*

 Hao

 --
 Hao Ren

 {Data, Software} Engineer @ ClaraVista

 Paris, France





-- 
Hao Ren

{Data, Software} Engineer @ ClaraVista

Paris, France

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley

Peter's suggestion sounds good, but watch out for the match case since I
believe you'll have to match on:

case (Row(feature1, feature2, ...), Row(label)) =

On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi try next code:

 val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{
 case Row(feture1, feture2,..., label) = LabeledPoint(label, 
 Vectors.dense(feature1, feature2, ...))
 }

 Thanks,
 Peter Rudenko

 On 2015-04-02 17:17, drarse wrote:

   Hello!,

 I have a questions since days ago. I am working with DataFrame and with
 Spark SQL I imported a jsonFile:

 /val df = sqlContext.jsonFile(file.json)/

 In this json I have the label and de features. I selected it:

 /
 val features = df.select (feature1,feature2,feature3,...);

 val labels = df.select (cassification)/

 But, now, I don't know create a LabeledPoint for RandomForest. I tried some
 solutions without success. Can you help me?

 Thanks for all!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

Do you have a full stack trace?

On Thu, Apr 2, 2015 at 11:45 AM, ogoh oke...@gmail.com wrote:


 Hello,
 My ETL uses sparksql to generate parquet files which are served through
 Thriftserver using hive ql.
 It especially defines a schema programmatically since the schema can be
 only
 known at runtime.
 With spark 1.2.1, it worked fine (followed

 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 ).

 I am trying to migrate into spark 1.3.0, but the API are confusing.
 I am not sure if the example of

 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 is still valid on Spark1.3.0?
 For example, DataType.StringType is not there any more.
 Instead, I found DataTypes.StringType etc. So, I migrated as below and it
 builds fine.
 But at runtime, it throws Exception.

 I appreciate any help.
 Thanks,
 Okehee

 == Exception thrown
 java.lang.reflect.InvocationTargetException
 scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
 java.lang.NoSuchMethodError:
 scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;

  my code's snippet
 import org.apache.spark.sql.types.DataTypes;
 DataTypes.createStructField(property, DataTypes.IntegerType, true)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-schema-in-Spark-1-3-failed-while-using-DataTypes-tp22362.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez

+1.

Caching is way too slow.

On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
Hi Experts,

I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.

Few questions :

1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?

val inputP = sqlContext.parquetFile(some HDFS path)
inputP.registerTempTable(sample_table)
inputP.persist(MEMORY_ONLY)
val result = sqlContext.sql(some sql query)
result.count

Note : Once the data is persisted to memory, it takes fraction of seconds to
return query result from the second query onwards. So my concern is how to
reduce the time when the data is first loaded to cache.

2. I have observed that if I omit the below line,
inputP.persist(MEMORY_ONLY)
the first time Query execution is comparatively quick (say it take
1min), as the load to Memory time is saved, but to my surprise the second
time I run the same query it takes 30 sec as the inputP is not constructed
from disk (checked from UI).

So my question is, Does spark use some kind of internal caching for inputP
in this scenario?

Thanks in advance

Regards,
Sam

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza

This isn't currently a capability that Spark has, though it has definitely
been discussed: https://issues.apache.org/jira/browse/SPARK-1061.  The
primary obstacle at this point is that Hadoop's FileInputFormat doesn't
guarantee that each file corresponds to a single split, so the records
corresponding to a particular partition at the end of the first job can end
up split across multiple partitions in the second job.

-Sandy

On Wed, Apr 1, 2015 at 9:09 PM, kjsingh kanwaljit.si...@guavus.com wrote:

 Hi,

 We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
 Tuple2. At the end of day, a daily job is launched, which works on the
 outputs of the hourly jobs.

 For data locality and speed, we wish that when the daily job launches, it
 finds all instances of a given key at a single executor rather than
 fetching
 it from others during shuffle.

 Is it possible to maintain key partitioning across jobs? We can control
 partitioning in one job. But how do we send keys to the executors of same
 node manager across jobs? And while saving data to HDFS, are the blocks
 allocated to the same data node machine as the executor for a partition?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman

Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 

If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!

-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
 short and long integers. Is there any other way that could of use here?
 
 My current method happens to have a large overhead (much more than actual 
 computation time). Also, I am short of memory at the driver when it has to 
 read the entire file.
 
 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com 
 wrote:
 If it’s a flat binary file and each record is the same length (in bytes), you 
 can use Spark’s binaryRecords method (defined on the SparkContext), which 
 loads records from one or more large flat binary files into an RDD. Here’s an 
 example in python to show how it works:
 
 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()
 
 # load the data back in
 from numpy import frombuffer
 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
 
 # these should be equal
 parsed.first()
 dat[0,:]
 
 
 Does that help?
 
 -
 jeremyfreeman.net
 @thefreemanlab
 
 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
 
 What are some efficient ways to read a large file into RDDs?
 
 For example, have several executors read a specific/unique portion of the 
 file and construct RDDs. Is this possible to do in Spark?
 
 Currently, I am doing a line-by-line read of the file at the driver and 
 constructing the RDD.

RE: Date and decimal datatype not working

2015-04-02 Thread BASAK, ANANDA

Thanks all. Finally I am able to run my code successfully. It is running in 
Spark 1.2.1. I will try it on Spark 1.3 too.

The major cause of all errors I faced was that the delimiter was not correctly 
declared.
val TABLE_A = 
sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = 
ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))
Now I am using following and that solved most of the issues:

val Delimeter = \\|
val TABLE_A = 
sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(Delimeter)).map(p
 = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))

Thanks again. My first code ran successfully giving me some confidence, now I 
will explore more.

Regards
Ananda

From: BASAK, ANANDA
Sent: Thursday, March 26, 2015 4:55 PM
To: Dean Wampler
Cc: Yin Huai; user@spark.apache.org
Subject: RE: Date and decimal datatype not working

Thanks all. I am installing Spark 1.3 now. Thought that I should better sync 
with the daily evolution of this new technology.
So once I install that, I will try to use the Spark-CSV library.

Regards
Ananda

From: Dean Wampler [mailto:deanwamp...@gmail.com]
Sent: Wednesday, March 25, 2015 1:17 PM
To: BASAK, ANANDA
Cc: Yin Huai; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Date and decimal datatype not working

Recall that the input isn't actually read until to do something that forces 
evaluation, like call saveAsTextFile. You didn't show the whole stack trace 
here, but it probably occurred while parsing an input line where one of your 
long fields is actually an empty string.

Because this is such a common problem, I usually define a parse method that 
converts input text to the desired schema. It catches parse exceptions like 
this and reports the bad line at least. If you can return a default long in 
this case, say 0, that makes it easier to return something.

dean



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Editionhttp://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafehttp://typesafe.com
@deanwamplerhttp://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 11:48 AM, BASAK, ANANDA 
ab9...@att.commailto:ab9...@att.com wrote:
Thanks. This library is only available with Spark 1.3. I am using version 
1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1.

So I am using following:
val MyDataset = sqlContext.sql(my select query”)

MyDataset.map(t = 
t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path)

But it is giving following error:
15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106)
java.lang.NumberFormatException: For input string: 
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:453)
at java.lang.Long.parseLong(Long.java:483)
at 
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230)

is there something wrong with the TSTAMP field which is Long datatype?

Thanks  Regards
---
Ananda Basak

From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com]
Sent: Monday, March 23, 2015 8:55 PM

To: BASAK, ANANDA
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Date and decimal datatype not working

To store to csv file, you can use 
Spark-CSVhttps://github.com/databricks/spark-csv library.

On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA 
ab9...@att.commailto:ab9...@att.com wrote:
Thanks. This worked well as per your suggestions. I had to run following:
val TABLE_A = 
sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = 
ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), 
BigDecimal(p(5)), BigDecimal(p(6

Now I am stuck at another step. I have run a SQL query, where I am Selecting 
from all the fields with some where clause , TSTAMP filtered with date range 
and order by TSTAMP clause. That is running fine.

Then I am trying to store the output in a CSV file. I am using 
saveAsTextFile(“filename”) function. But it is giving error. Can you please 
help me to write a proper syntax to store output in a CSV file?


Thanks  Regards
---
Ananda Basak

From: BASAK, ANANDA
Sent: Tuesday, March 17, 2015 3:08 PM
To: Yin Huai
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Date and decimal datatype not working

Ok, thanks for the suggestions. Let me try and will confirm all.

Regards
Ananda

From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com]
Sent: Tuesday, March 17, 2015 3:04 PM
To: BASAK, ANANDA
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Date and decimal datatype not working

p(0) is a String. So, you need to explicitly convert it to a Long. e.g. 
p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, 
you need to create BigDecimal objects from your String values.

On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA

Need a spark mllib tutorial

2015-04-02 Thread Phani Yadavilli -X (pyadavil)

Hi,

I am new to the spark MLLib and I was browsing through the internet for good 
tutorials advanced to the spark documentation example. But, I do not find any. 
Need help.

Regards
Phani Kumar

Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh

Here's one:
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Reza

On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) 
pyada...@cisco.com wrote:

  Hi,



 I am new to the spark MLLib and I was browsing through the internet for
 good tutorials advanced to the spark documentation example. But, I do not
 find any. Need help.



 Regards

 Phani Kumar

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread Michael Quinlan

I was able to hack this on my similar setup issue by running (on the driver) 

$ sudo hostname ip

Where ip is the same value set in the spark.driver.host property. This
isn't a solution I would use universally and hope the someone can fix this
bug in the distribution.

Regards,

Mike



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22363.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley

Check out the Spark docs for that parameter: *maxIterations*
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote:

 Hello,
 i am running the Kmeans algorithm in cluster mode from Mllib and i was
 wondering if i could run the algorithm with fixed number of iterations in
 some way.

 Thanks




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez

To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.

With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
Ought to reduce shuffle performance hit (someone else confirm?)

#7 see default.shuffle.partitions (default: 200)

On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB the
Shuffle Write is 10 GB Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data
in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous.

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to
make
a virtual 10-15 node cluster.

Even then the input size shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't
much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures design

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from
DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD,
have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages
consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume
when
required in the future.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas

yup a related JIRA is here
https://issues.apache.org/jira/browse/SPARK-5113 which you might want to
leave a comment in.  This can be quite tricky we found ! but there are a
host of env variable hacks you can use when launching spark masters/slaves.

On Thu, Apr 2, 2015 at 5:18 PM, Michael Quinlan mq0...@gmail.com wrote:

 I was able to hack this on my similar setup issue by running (on the
 driver)

 $ sudo hostname ip

 Where ip is the same value set in the spark.driver.host property. This
 isn't a solution I would use universally and hope the someone can fix this
 bug in the distribution.

 Regards,

 Mike



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22363.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
jay vyas

RE: Spark SQL. Memory consumption

It is hard to say what could be reason without more detail information. If you 
provide some more information, maybe people here can help you better.
1) What is your worker's memory setting? It looks like that your nodes have 
128G physical memory each, but what do you specify for the worker's heap size? 
If you can paste your spark-env.sh and spark-defaults.conf content here, it 
will be helpful.2) You are doing join with 2 tables. 8G parquet files is small, 
compared to the heap you gave. But is it for one table? 2 tables? Is the data 
compressed?3) Your join key is different as your grouping keys, so my 
assumption is that this query should lead to 4 stages (I could be wrong, as I 
am kind of new to Spark SQL too). Is that right? If so, on what stage the OOM 
happened? With this information, it can help us to better judge which part 
caused OOM.4) When you set the spark.shuffle.partitions to 1024, did the stage 
3 and 4 really create 1024 tasks? 5) When the OOM happens, at least you can 
past the stack track of OOM, so it will help people here to guess which part of 
Spark leads to the OOM, so give you better suggests.
Thanks
Yong

Date: Thu, 2 Apr 2015 17:46:48 +0200
Subject: Spark SQL. Memory consumption
From: masfwo...@gmail.com
To: user@spark.apache.org

Hi. 
I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS  SELECT   field1  ,field2 ,field3 
,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7),MIN(field8)
,SUM(field9 / 100)  ,COUNT(field10) ,SUM(IF(field11  -500, 1, 0))  
,MAX(field12)   ,SUM(IF(field13 = 1, 1, 0))
,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 
= 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))  
FROM table1 CL  JOIN table2 netwON CL.field15 = 
netw.id WHERE   AND field3 IS NOT NULL  AND 
field4 IS NOT NULL  AND field5 IS NOT NULL  GROUP BY 
field1,field2,field3,field4, netw.field5

spark-submit --master spark://master:7077 --driver-memory 20g --executor-memory 
60g --class GMain project_2.10-1.0.jar --driver-class-path 
'/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options 
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2 
./error


Input data is 8GB in parquet format. Many times crash by GC overhead. I've 
fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB 
RAM/node) is collapsed.
Is it a query too difficult to Spark SQL? Would It be better to do it in 
Spark?Am I doing something wrong?

Thanks-- 


Regards.
Miguel Ángel

RE: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang

Michael, thanks for the response and looking forward to try 1.3.1

 



From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Friday, April 03, 2015 6:52 AM
To: Haopu Wang
Cc: user
Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q)
among (k, SUM('p.q));

 

Thanks for reporting.  The root cause is (SPARK-5632
https://issues.apache.org/jira/browse/SPARK-5632 ), which is actually
pretty hard to fix.  Fortunately, for this particular case there is an
easy workaround: https://github.com/apache/spark/pull/5337

 

We can try to include this in 1.3.1.

 

On Thu, Apr 2, 2015 at 3:29 AM, Haopu Wang hw...@qilinsoft.com wrote:

Hi, I want to rename an aggregation field using DataFrame API. The
aggregation is done on a nested field. But I got below exception.

Do you see the same issue and any workaround? Thank you very much!

 

==

Exception in thread main org.apache.spark.sql.AnalysisException:
Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

at
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:
162)

at
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:
162)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)

at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)

at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)

at
org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:244)

at
org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:243)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s
cala:33)

at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at org.apache.spark.sql.DataFrame.toDF(DataFrame.scala:243)

==

 

And this code can be used to reproduce the issue:

 

  case class ChildClass(q: Long)

  case class ParentClass(k: String, p: ChildClass)

 

  def main(args: Array[String]): Unit = {



val conf = new
SparkConf().setAppName(DFTest).setMaster(local[*])

val ctx = new SparkContext(conf)

val sqlCtx = new HiveContext(ctx)

 

import sqlCtx.implicits._

 

val source = ctx.makeRDD(Seq(ParentClass(c1,
ChildClass(100.toDF

 

import org.apache.spark.sql.functions._

 

val target = source.groupBy('k).agg('k, sum(p.q))



// This line prints the correct contents

// k  SUM('p.q)

// c1 100

target.show



// But this line triggers the exception

target.toDF(key, total)

 

==

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread fightf...@163.com

Hi, there 

you may need to add : 
  import sqlContext.implicits._

Best,
Sun



fightf...@163.com
 
From: java8964
Date: 2015-04-03 10:15
To: user@spark.apache.org
Subject: Cannot run the example in the Spark 1.3.0 following the document
I tried to check out what Spark SQL 1.3.0. I installed it and following the 
online document here:

http://spark.apache.org/docs/latest/sql-programming-guide.html

In the example, it shows something like this:
// Select everybody, but increment the age by 1
df.select(name, df(age) + 1).show()
// name(age + 1)
// Michael null
// Andy31
// Justin  20

But what I got on my Spark 1.3.0 is the following error:

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
  /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@1c845f64
scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: 
org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala 
df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)scala df.select(name, df(age) + 
1).show()
console:30: error: overloaded method value select with alternatives:
  (col: String,cols: String*)org.apache.spark.sql.DataFrame and
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
  df.select(name, df(age) + 1).show()
 ^

Is this a bug in Spark 1.3.0, or my build having some problem?

Thanks

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

This looks to me like you have incompatible versions of scala on your
classpath?

On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote:

 yes, below is the stacktrace.
 Thanks,
 Okehee

 java.lang.NoSuchMethodError: 
 scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
   at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97)
   at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203)
   at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288)
   at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045)
   at 
 scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16)
   at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16)
   at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041)
   at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16)
   at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16)
   at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17)
   at scala.reflect.runtime.package$.universe(package.scala:17)
   at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337)
   at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351)
   at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367)
   at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala)
   at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30)
   at 
 com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282)
   at 
 com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
   at org.testng.internal.Invoker.invokeMethod(Invoker.java:696)
   at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882)
   at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189)
   at 
 org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124)
   at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
   at org.testng.TestRunner.privateRun(TestRunner.java:767)
   at org.testng.TestRunner.run(TestRunner.java:617)
   at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
   at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343)
   at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305)
   at org.testng.SuiteRunner.run(SuiteRunner.java:254)
   at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
   at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
   at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
   at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
   at org.testng.TestNG.run(TestNG.java:1057)
   at 
 org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115)
   at 
 org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
   at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
   at 
 org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
   at 
 org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
   at com.sun.proxy.$Proxy2.stop(Unknown Source)
   at 
 org.gradle.api.internal.tasks.testing.worker.TestWorker.stop(TestWorker.java:115)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
   at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
   at 
 org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355)
   at

Cannot run the example in the Spark 1.3.0 following the document

I tried to check out what Spark SQL 1.3.0. I installed it and following the 
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment 
the age by 1
df.select(name, df(age) + 1).show()
// name(age + 1)
// Michael null
// Andy31
// Justin  20
But what I got on my Spark 1.3.0 is the following error:
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@1c845f64
scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: 
org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala 
df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)scala df.select(name, df(age) + 
1).show()
console:30: error: overloaded method value select with alternatives:
  (col: String,cols: String*)org.apache.spark.sql.DataFrame and
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
  df.select(name, df(age) + 1).show()
 ^
Is this a bug in Spark 1.3.0, or my build having some problem?
Thanks

RE: ArrayBuffer within a DataFrame

2015-04-02 Thread Mohammed Guller

Hint:
DF.rdd.map{}

Mohammed

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Thursday, April 2, 2015 7:10 PM
To: user@spark.apache.org
Subject: ArrayBuffer within a DataFrame

Quick question - the output of a dataframe is in the format of:

[2015-04, ArrayBuffer(A, B, C, D)]

and I'd like to return it as:

2015-04, A
2015-04, B
2015-04, C
2015-04, D

What's the best way to do this?

Thanks in advance!

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread a mesar

Yes, with spark.cleaner.ttl set there is no cleanup.  We pass --properties-file
spark-dev.conf to spark-submit where  spark-dev.conf contains:

spark.master spark://10.250.241.66:7077
spark.logConf true
spark.cleaner.ttl 1800
spark.executor.memory 10709m
spark.cores.max 4
spark.shuffle.consolidateFiles true

On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com wrote:

 Are you saying that even with the spark.cleaner.ttl set your files are not
 getting cleaned up?

 TD

 On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never
 deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh

Michael,
You are right.  The build brought  org.scala-lang:scala-library:2.10.1
from other package (as below).
It works fine after excluding the old scala version.
Thanks a lot,
Okehee

== dependency:

|+--- org.apache.kafka:kafka_2.10:0.8.1.1

||+--- com.yammer.metrics:metrics-core:2.2.0

|||\--- org.slf4j:slf4j-api:1.7.2 - 1.7.7

||+--- org.xerial.snappy:snappy-java:1.0.5

||+--- org.apache.zookeeper:zookeeper:3.3.4 - 3.4.5

|||+--- org.slf4j:slf4j-api:1.6.1 - 1.7.7

|||+--- log4j:log4j:1.2.15

|||+--- jline:jline:0.9.94

|||\--- org.jboss.netty:netty:3.2.2.Final

||+--- net.sf.jopt-simple:jopt-simple:3.2 - 4.6

||+--- org.scala-lang:scala-library:2.10.1

On Thu, Apr 2, 2015 at 4:45 PM, Michael Armbrust mich...@databricks.com
wrote:

 This looks to me like you have incompatible versions of scala on your
 classpath?

 On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote:

 yes, below is the stacktrace.
 Thanks,
 Okehee

 java.lang.NoSuchMethodError: 
 scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
  at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97)
  at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203)
  at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288)
  at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045)
  at 
 scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16)
  at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16)
  at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041)
  at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16)
  at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16)
  at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17)
  at scala.reflect.runtime.package$.universe(package.scala:17)
  at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337)
  at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351)
  at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367)
  at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala)
  at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30)
  at 
 com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282)
  at 
 com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:483)
  at 
 org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
  at org.testng.internal.Invoker.invokeMethod(Invoker.java:696)
  at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882)
  at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189)
  at 
 org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124)
  at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
  at org.testng.TestRunner.privateRun(TestRunner.java:767)
  at org.testng.TestRunner.run(TestRunner.java:617)
  at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
  at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343)
  at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305)
  at org.testng.SuiteRunner.run(SuiteRunner.java:254)
  at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
  at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
  at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
  at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
  at org.testng.TestNG.run(TestNG.java:1057)
  at 
 org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115)
  at 
 org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:483)
  at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
  at 
 org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
  at 
 org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
  at

maven compile error

2015-04-02 Thread myelinji

Hi,all:   Just now i checked out spark-1.2 on github , wanna to build it use 
maven, how ever I encountered an error during compiling:
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on 
project spark-catalyst_2.10: wrap: 
scala.reflect.internal.MissingRequirementError: object scala.runtime in 
compiler mirror not found. - [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on 
project spark-catalyst_2.10: wrap: 
scala.reflect.internal.MissingRequirementError: object scala.runtime in 
compiler mirror not found.
 at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: wrap: 
scala.reflect.internal.MissingRequirementError: object scala.runtime in 
compiler mirror not found.
 at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
 at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
 at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
 ... 19 more
Caused by: scala.reflect.internal.MissingRequirementError: object scala.runtime 
in compiler mirror not found.
 at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
 at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
 at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
 at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
 at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
 at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
 at 
scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
 at 
scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
 at scala.tools.nsc.Global$Run.init(Global.scala:1290)
 at xsbt.CachedCompiler0$$anon$2.init(CompilerInterface.scala:113)
 at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)
 at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
 at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
 at

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska

Hi folks, having some seemingly noob issues with the dataframe API.

I have a DF which came from the csv package.

1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF

2. I am trying to use the syntax used in various blog posts but can't
figure out how to reference a column by name:

scala df.filter(customer_id!=)
console:23: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame and
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
  df.filter(customer_id!=)


3. what would be the recommended way to drop a row containing a null value
-- is it possible to do this:
scala df.filter(customer_id IS NOT NULL)

RE: Cannot run the example in the Spark 1.3.0 following the document

The import command already run.
Forgot the mention, the rest of examples related to df all works, just this 
one caused problem.
Thanks
Yong

Date: Fri, 3 Apr 2015 10:36:45 +0800
From: fightf...@163.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: Re: Cannot run the example in the Spark 1.3.0 following the document


Hi, there 
you may need to add :   import sqlContext.implicits._
Best,Sun


fightf...@163.com
 From: java8964Date: 2015-04-03 10:15To: user@spark.apache.orgSubject: Cannot 
run the example in the Spark 1.3.0 following the document
I tried to check out what Spark SQL 1.3.0. I installed it and following the 
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment 
the age by 1
df.select(name, df(age) + 1).show()
// name(age + 1)
// Michael null
// Andy31
// Justin  20
But what I got on my Spark 1.3.0 is the following error:
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
  /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@1c845f64
scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: 
org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala 
df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)scala df.select(name, df(age) + 
1).show()
console:30: error: overloaded method value select with alternatives:
  (col: String,cols: String*)org.apache.spark.sql.DataFrame and
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
  df.select(name, df(age) + 1).show()
 ^
Is this a bug in Spark 1.3.0, or my build having some problem?
Thanks

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh

yes, below is the stacktrace.
Thanks,
Okehee

java.lang.NoSuchMethodError:
scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97)
at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203)
at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288)
at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045)
at 
scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16)
at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16)
at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041)
at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16)
at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16)
at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17)
at scala.reflect.runtime.package$.universe(package.scala:17)
at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337)
at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351)
at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367)
at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala)
at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30)
at 
com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282)
at 
com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:696)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189)
at 
org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
at org.testng.TestRunner.privateRun(TestRunner.java:767)
at org.testng.TestRunner.run(TestRunner.java:617)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305)
at org.testng.SuiteRunner.run(SuiteRunner.java:254)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
at org.testng.TestNG.run(TestNG.java:1057)
at 
org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115)
at 
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
at 
org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy2.stop(Unknown Source)
at 
org.gradle.api.internal.tasks.testing.worker.TestWorker.stop(TestWorker.java:115)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355)
at 
org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64)

Re: Spark SQL does not read from cached table if table is renamed

I'll add we just back ported this so it'll be included in 1.2.2 also.

On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust mich...@databricks.com
wrote:

 This is fixed in Spark 1.3.
 https://issues.apache.org/jira/browse/SPARK-5195

 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn...@exchange.microsoft.com
  wrote:

  Hi all,



 Noticed a bug in my current version of Spark 1.2.1.



 After a table is cached with “cache table table” command, query will
 not read from memory if SQL query renames the table.



 This query reads from in memory table

 i.e. select hivesampletable.country from default.hivesampletable  group
 by hivesampletable.country



 This query with renamed table reads from hive

 i.e. select table.country from default.hivesampletable table group by
 table.country





 Is this a known bug?

 Most BI tools rename tables to avoid table name collision.



 Thanks,

 Judy

Re: Spark-events does not exist error, while it does with all the req. rights

2015-04-02 Thread Marcelo Vanzin

FYI I wrote a small test to try to reproduce this, and filed
SPARK-6688 to track the fix.

On Tue, Mar 31, 2015 at 1:15 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Hmmm... could you try to set the log dir to
 file:/home/hduser/spark/spark-events?

 I checked the code and it might be the case that the behaviour changed
 between 1.2 and 1.3...

 On Mon, Mar 30, 2015 at 6:44 PM, Tom Hubregtsen thubregt...@gmail.com wrote:
 The stack trace for the first scenario and your suggested improvement is
 similar, with as only difference the first line (Sorry for not including
 this):
 Log directory /home/hduser/spark/spark-events does not exist.

 To verify your premises, I cd'ed into the directory by copy pasting the path
 listed in the error message (i, ii), created a text file, closed it an
 viewed it, and deleted it (iii). My findings were reconfirmed by my
 colleague. Any other ideas?

 Thanks,

 Tom


 On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote:

 So, the error below is still showing the invalid configuration.

 You mentioned in the other e-mails that you also changed the
 configuration, and that the directory really, really exists. Given the
 exception below, the only ways you'd get the error with a valid
 configuration would be if (i) the directory didn't exist, (ii) it
 existed but the user could not navigate to it or (iii) it existed but
 was not actually a directory.

 So please double-check all that.

 On Mon, Mar 30, 2015 at 5:11 PM, Tom Hubregtsen thubregt...@gmail.com
 wrote:
  Stack trace:
  15/03/30 17:37:30 INFO storage.BlockManagerMaster: Registered
  BlockManager
  Exception in thread main java.lang.IllegalArgumentException: Log
  directory
  ~/spark/spark-events does not exist.


 --
 Marcelo





 --
 Marcelo



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Reading a large file (binary) into RDD

I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!

-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.

Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist

Hi Young,

Sorry for the duplicate post, want to reply to all.

I just downloaded the bits prebuilt form apache spark download site.
Started the spark shell and got the same error.

I then started the shell as follows:

./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2
--driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars $(echo
~/Downloads/apache-hive-0.13.1-bin/lib/*.jar | tr ' ' ',')

this worked, or at least got rid of this

scala case class MetricTable(path: String, pathElements: String, name:
String, value: String) scala.reflect.internal.Types$TypeError: bad symbolic
reference. A signature in HiveMetastoreCatalog.class refers to term cache in
package com.google.common which is not available. It may be completely
missing from the current classpath, or the version on the classpath might
be incompatible with the version used when compiling HiveMetastoreCatalog
.class. That entry seems to have slain the compiler. Shall I replay your
session? I can re-run each line except the last one. [y/n]

Still getting the ClassNotFoundException, json_tuple, from this statement
same as in 1.2.1:

sql(
SELECT path, name, value, v1.peValue, v1.peName
 FROM metric_table
   lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue
)
.collect.foreach(println(_))


15/04/02 20:50:14 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName
 FROM metric_table
   lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue15/04/02 20:50:14 INFO ParseDriver:
Parse Completed
java.lang.ClassNotFoundException: json_tuple
at 
scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

Any ideas on the json_tuple exception?


Modified the syntax to take into account some minor changes in 1.3.  The
one posted this morning was from my 1.2.1 test.

import sqlContext.implicits._case class MetricTable(path: String,
pathElements: String, name: String, value: String)val mt = new
MetricTable(path: /DC1/HOST1/,
pathElements: [{node: DataCenter,value: DC1},{node:
host,value: HOST1}],
name: Memory Usage (%),
value: 29.590943279257175)val rdd1 =
sc.makeRDD(List(mt))val df = rdd1.toDF
df.printSchema
df.show
df.registerTempTable(metric_table)
sql(
SELECT path, name, value, v1.peValue, v1.peName
   FROM metric_table
  lateral view json_tuple(pathElements, 'name', 'value') v1
 as peName, peValue
)
.collect.foreach(println(_))


On Thu, Apr 2, 2015 at 8:21 PM, java8964 java8...@hotmail.com wrote:

 Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but
 I cannot reproduce it on Spark 1.2.1

 If we check the code change below:

 Spark 1.3 branch

 https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

 vs

 Spark 1.2 branch

 https://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

 You can see that on line 24:

 import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}

 is introduced on 1.3 branch.

 The error basically mean runtime com.google.common.cache package cannot be
 found in the classpath.

 Either you and me made the same mistake when we build Spark 1.3.0, or
 there are something wrong with Spark 1.3 pom.xml file.

 Here is how I built the 1.3.0:

 1) Download the spark 1.3.0 source
 2) make-distribution --targz -Dhadoop.version=1.1.1 -Phive -Phive-0.12.0
 -Phive-thriftserver -DskipTests

 Is this only due to that I built against Hadoop 1.x?

 Yong


 --
 Date: Thu, 2 Apr 2015 13:56:33 -0400
 Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class
 refers to term cache in package com.google.common which is not available
 From: tsind...@gmail.com
 To: user@spark.apache.org


 I was trying a simple test from the spark-shell to see if 1.3.0 would
 address a problem I was having with locating the json_tuple class and got
 the following error:

 scala import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive._

 scala val sqlContext = new HiveContext(sc)sqlContext: 
 org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@79c849c7

 scala import sqlContext._
 import sqlContext._

 scala case class MetricTable(path: String, pathElements: String, name: 
 String, value: String)scala.reflect.internal.Types$TypeError: bad symbolic 
 reference. A signature in HiveMetastoreCatalog.class refers to term cachein 
 package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 HiveMetastoreCatalog.class.
 That entry seems to have slain

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai

For cast, you can use selectExpr method. For example,
df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2).
Or, df.select(df(colA).cast(int), ...)

On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com
wrote:

 val df = Seq((test, 1)).toDF(col1, col2)

 You can use SQL style expressions as a string:

 df.filter(col1 IS NOT NULL).collect()
 res1: Array[org.apache.spark.sql.Row] = Array([test,1])

 Or you can also reference columns using df(colName) or quot;colName or
 col(colName)

 df.filter(df(col1) === test).collect()
 res2: Array[org.apache.spark.sql.Row] = Array([test,1])

 On Thu, Apr 2, 2015 at 7:45 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi folks, having some seemingly noob issues with the dataframe API.

 I have a DF which came from the csv package.

 1. What would be an easy way to cast a column to a given type -- my DF
 columns are all typed as strings coming from a csv. I see a schema getter
 but not setter on DF

 2. I am trying to use the syntax used in various blog posts but can't
 figure out how to reference a column by name:

 scala df.filter(customer_id!=)
 console:23: error: overloaded method value filter with alternatives:
   (conditionExpr: String)org.apache.spark.sql.DataFrame and
   (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
  cannot be applied to (Boolean)
   df.filter(customer_id!=)

 
 3. what would be the recommended way to drop a row containing a null
 value -- is it possible to do this:
 scala df.filter(customer_id IS NOT NULL)

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das

Are you saying that even with the spark.cleaner.ttl set your files are not
getting cleaned up?

TD

On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I 
cannot reproduce it on Spark 1.2.1
If we check the code change below:
Spark 1.3 
branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
vs 
Spark 1.2 
branchhttps://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
You can see that on line 24:
import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}
is introduced on 1.3 branch.
The error basically mean runtime com.google.common.cache package cannot be 
found in the classpath.
Either you and me made the same mistake when we build Spark 1.3.0, or there are 
something wrong with Spark 1.3 pom.xml file.
Here is how I built the 1.3.0:
1) Download the spark 1.3.0 source2) make-distribution --targz 
-Dhadoop.version=1.1.1 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests
Is this only due to that I built against Hadoop 1.x?
Yong

Date: Thu, 2 Apr 2015 13:56:33 -0400
Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class 
refers to term cache in package com.google.common which is not available
From: tsind...@gmail.com
To: user@spark.apache.org

I was trying a simple test from the spark-shell to see if 1.3.0 would address a 
problem I was having with locating the json_tuple class and got the following 
error:
scala import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala val sqlContext = new HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@79c849c7

scala import sqlContext._
import sqlContext._

scala case class MetricTable(path: String, pathElements: String, name: String, 
value: String)
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in 
HiveMetastoreCatalog.class refers to term cache
in package com.google.common which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
HiveMetastoreCatalog.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]
Abandoning crashed session.I entered the shell as follows:./bin/spark-shell 
--master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path 
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarhive-site.xml looks 
like this:?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration
  property
namehive.semantic.analyzer.factory.impl/name
valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
  /property

  property
namehive.metastore.sasl.enabled/name
valuefalse/value
  /property

  property
namehive.server2.authentication/name
valueNONE/value
  /property

  property
namehive.server2.enable.doAs/name
valuetrue/value
  /property

  property
namehive.warehouse.subdir.inherit.perms/name
valuetrue/value
  /property

  property
namehive.metastore.schema.verification/name
valuefalse/value
  /property

  property
namejavax.jdo.option.ConnectionURL/name

valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value
descriptionmetadata is stored in a MySQL server/description
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionMySQL JDBC driver class/description
  /property

  property
namejavax.jdo.option.ConnectionUserName/name
value***/value
  /property

  property
namejavax.jdo.option.ConnectionPassword/name
value/value
  /property

/configurationI have downloaded a clean version of 1.3.0 and tried it again 
but same error. Is this a know issue? Or a configuration issue on my part?TIA 
for the assistances.-Todd

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee

Quick question - the output of a dataframe is in the format of:

[2015-04, ArrayBuffer(A, B, C, D)]

and I'd like to return it as:

2015-04, A
2015-04, B
2015-04, C
2015-04, D

What's the best way to do this?

Thanks in advance!

回复：How to learn Spark ?

2015-04-02 Thread luohui20001

The best way of learning spark is to use spark
you may follow the instruction of apache spark 
website.http://spark.apache.org/docs/latest/
 
download-deploy it in standalone mode-run some examples-try cluster deploy 
mode- then try to develop your own app and deploy it in your spark cluster.
 
and it's better to learn scala well if you wanna dive into spark.
Also there are some books about spark.






 
Thanksamp;Best regards!
San.Luo



- 原始邮件 -
发件人：Star Guo st...@ceph.me
收件人：user@spark.apache.org
主题：How to learn Spark ?
日期：2015年04月02日 16点19分



Hi, all
 
I am new to here. Could you give me some suggestion to learn Spark ? Thanks.
 
Best Regards,
Star Guo

Re: Connection pooling in spark jobs

2015-04-02 Thread Ted Yu

http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

The question doesn't seem to be Spark specific, btw




 On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:
 
 Hi,
 
 We have a case that we will have to run concurrent jobs (for the same 
 algorithm) on different data sets. And these jobs can run in parallel and 
 each one of them would be fetching the data from the database.
 We would like to optimize the database connections by making use of 
 connection pooling. Any suggestions / best known ways on how to achieve this. 
 The database in question is Oracle
 
 Thanks,
 Sateesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error in SparkSQL/Scala IDE

It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection
 in the Spark SQL library. Make sure it's in the classpath and the version
is correct, too.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 8:39 AM, Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Hi Everyone,

 I am getting following error while registering table using Scala IDE.
 Please let me know how to resolve this error. I am using Spark 1.2.1

   import sqlContext.createSchemaRDD



   val empFile = sc.textFile(/tmp/emp.csv, 4)

   .map ( _.split(,) )

   .map( row= Employee(row(0),row(1), row(2), row(3), row(
 4)))

   empFile.registerTempTable(Employees)

 Thanks

 Sathish

 Exception in thread main scala.reflect.internal.MissingRequirementError:
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
 primordial classloader with boot classpath
 [/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-library.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-reflect.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-actor.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-swing.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-compiler.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/classes]
 not found.

 at scala.reflect.internal.MissingRequirementError$.signal(
 MissingRequirementError.scala:16)

 at scala.reflect.internal.MissingRequirementError$.notFound(
 MissingRequirementError.scala:17)

 at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
 Mirrors.scala:48)

 at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
 Mirrors.scala:61)

 at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(
 Mirrors.scala:72)

 at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)

 at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)

 at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(
 ScalaReflection.scala:115)

 at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(
 TypeTags.scala:231)

 at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)

 at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)

 at scala.reflect.api.Universe.typeOf(Universe.scala:59)

 at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
 ScalaReflection.scala:115)

 at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
 ScalaReflection.scala:33)

 at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
 ScalaReflection.scala:100)

 at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
 ScalaReflection.scala:33)

 at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(
 ScalaReflection.scala:94)

 at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(
 ScalaReflection.scala:33)

 at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)

 at com.svairavelu.examples.QueryCSV$.main(QueryCSV.scala:24)

  at com.svairavelu.examples.QueryCSV.main(QueryCSV.scala)

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong

yes! 
thank you very much:-)

 在 2015年4月2日，下午7:13，Sean Owen so...@cloudera.com 写道：
 
 Right, I asked because in your original message, you were looking at
 the initialization to a random vector. But that is the initial state,
 not final state.
 
 On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote:
 NO， I’m referring to the result.
 you means there might be so many zero features in the als result ？
 
 I think it is not related to the initial state, but I do not know why the
 percent of zero-vector  is so high(50% around)
 
 I looked into the ALS.scala, the user and product factors seems to be
 initialized by a gaussian distribution, so it should not be all-zero vector,
 right?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu

Hi Everyone,

I am getting following error while registering table using Scala IDE.
Please let me know how to resolve this error. I am using Spark 1.2.1

  import sqlContext.createSchemaRDD



  val empFile = sc.textFile(/tmp/emp.csv, 4)

  .map ( _.split(,) )

  .map( row= Employee(row(0),row(1), row(2), row(3), row(4
)))

  empFile.registerTempTable(Employees)

Thanks

Sathish

Exception in thread main scala.reflect.internal.MissingRequirementError:
class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
primordial classloader with boot classpath
[/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-library.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-reflect.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-actor.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-swing.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-compiler.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/classes]
not found.

at scala.reflect.internal.MissingRequirementError$.signal(
MissingRequirementError.scala:16)

at scala.reflect.internal.MissingRequirementError$.notFound(
MissingRequirementError.scala:17)

at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
Mirrors.scala:48)

at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(
Mirrors.scala:61)

at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(
Mirrors.scala:72)

at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)

at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)

at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(
ScalaReflection.scala:115)

at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(
TypeTags.scala:231)

at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)

at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)

at scala.reflect.api.Universe.typeOf(Universe.scala:59)

at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
ScalaReflection.scala:115)

at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(
ScalaReflection.scala:100)

at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(
ScalaReflection.scala:94)

at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(
ScalaReflection.scala:33)

at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)

at com.svairavelu.examples.QueryCSV$.main(QueryCSV.scala:24)

 at com.svairavelu.examples.QueryCSV.main(QueryCSV.scala)

Re: Connection pooling in spark jobs

Right, I am aware on how to use connection pooling with oracle, but the
specific question is how to use it in the context of spark job execution
On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh

[SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang

Hi, I want to rename an aggregation field using DataFrame API. The
aggregation is done on a nested field. But I got below exception.

Do you see the same issue and any workaround? Thank you very much!

 

==

Exception in thread main org.apache.spark.sql.AnalysisException:
Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

at
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:
162)

at
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:
162)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)

at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)

at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)

at
org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:244)

at
org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:243)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s
cala:33)

at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at org.apache.spark.sql.DataFrame.toDF(DataFrame.scala:243)

==

 

And this code can be used to reproduce the issue:

 

  case class ChildClass(q: Long)

  case class ParentClass(k: String, p: ChildClass)

 

  def main(args: Array[String]): Unit = {



val conf = new
SparkConf().setAppName(DFTest).setMaster(local[*])

val ctx = new SparkContext(conf)

val sqlCtx = new HiveContext(ctx)

 

import sqlCtx.implicits._

 

val source = ctx.makeRDD(Seq(ParentClass(c1,
ChildClass(100.toDF

 

import org.apache.spark.sql.functions._

 

val target = source.groupBy('k).agg('k, sum(p.q))



// This line prints the correct contents

// k  SUM('p.q)

// c1 100

target.show



// But this line triggers the exception

target.toDF(key, total)

 

==

Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf

No, in my company are using cloudera distributions and 1.2.0 is the last
version of spark.

Thanks

On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust mich...@databricks.com
wrote:

 Can you try with Spark 1.3?  Much of this code path has been rewritten /
 improved in this version.

 On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote:


 Hi.

 In Spark SQL 1.2.0, with HiveContext, I'm executing the following
 statement:

 CREATE TABLE testTable STORED AS PARQUET AS
  SELECT
 field1
  FROM table1

 *field1 is SMALLINT. If table1 is in text format all it's ok, but if
 table1 is in parquet format, spark returns the following error*:



 15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1
 times; aborting job
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 26 in stage 1.0 failed 1 times, most recent
 failure: Lost task 26.0 in stage 1.0 (TID 28, localhost):
 java.lang.ClassCastException: java.lang.Integer cannot be cast to
 java.lang.Short
 at
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 Thanks!
 --


 Regards.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel

Matrix Transpose

2015-04-02 Thread Spico Florin

Hello!
  I have a CSV file that has the following content:
C1;C2;C3
11;22;33
12;23;34
13;24;35
 What is the best approach to use Spark (API, MLLib) for achieving the
transpose of it?
C1 11 12 13
C2 22 23 24
C3 33 34 35


I look forward for your solutions and suggestions (some Scala code will be
really helpful).

Thanks.
 Florin

P.S. In reality my matrix has more than 1000 columns and more than 1
million rows.

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong

NO， I’m referring to the result.
you means there might be so many zero features in the als result ？

I think it is not related to the initial state, but I do not know why the 
percent of zero-vector  is so high(50% around)


 在 2015年4月2日，下午6:08，Sean Owen so...@cloudera.com 写道：
 
 You're referring to the initialization, not the result, right? It's possible 
 that the resulting weight vectors are sparse although this looks surprising 
 to me. But it is not related to the initial state, right?
 
 On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com 
 mailto:lisend...@163.com wrote:
 I found that there are about 50% all-zero vectors in the ALS result ( both in 
 userFeatures and productFeatures)
 
 the result looks like this:
 
 PastedGraphic-1.tiff
 
 
 I looked into the ALS.scala, the user and product factors seems to be 
 initialized by a gaussian distribution, so it should not be all-zero vector, 
 right?
 
 
  邮件带有附件预览链接，若您转发或回复此邮件时不希望对方预览附件，建议您手动删除链接。
 共有 1 个附件
 PastedGraphic-1.tiff(607K)
 极速下载 
 http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiffmid=1tbiDQnPDVQG6QKGSwAAsypart=3sign=60d32e9f90d3dd36c5858328dd96eabetime=1427969402uid=lisendong%40163.com

Re: there are about 50% all-zero vector in the als result

Right, I asked because in your original message, you were looking at
the initialization to a random vector. But that is the initial state,
not final state.

On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote:
 NO， I’m referring to the result.
 you means there might be so many zero features in the als result ？

 I think it is not related to the initial state, but I do not know why the
 percent of zero-vector  is so high(50% around)

 I looked into the ALS.scala, the user and product factors seems to be
 initialized by a gaussian distribution, so it should not be all-zero vector,
 right?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Mllib kmeans #iteration

2015-04-02 Thread podioss

Hello,
i am running the Kmeans algorithm in cluster mode from Mllib and i was
wondering if i could run the algorithm with fixed number of iterations in
some way.

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to learn Spark ?

2015-04-02 Thread prabeesh k

You can also refer this blog http://blog.prabeeshk.com/blog/archives/

On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote:

 Hi, all



 I am new to here. Could you give me some suggestion to learn Spark ?
 Thanks.



 Best Regards,

 Star Guo

Connection pooling in spark jobs

Hi,

We have a case that we will have to run concurrent jobs (for the same
algorithm) on different data sets. And these jobs can run in parallel and
each one of them would be fetching the data from the database.
We would like to optimize the database connections by making use of
connection pooling. Any suggestions / best known ways on how to achieve
this. The database in question is Oracle

Thanks,
Sateesh

Re: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-02 Thread Wollert, Fabian

Hey Christopher,

I'm working with Teng on this issue. Thank you for the explanation. I tried
both workarounds:

just leaving hive.metastore.warehouse.dir empty is not doing anything.
Still the tmp data is written to S3 and the job attempts to
rename/copy+delete from S3 to S3. But anyway, since the wished effect of
this setting was not working before, we will discard this. So it will be
empty in the future.

I tried you're attempt with copying the hive-jars to the spark class path.
It was breaking then (the query did not execute at all) because of this
error message:

errorMessage:java.lang.NoSuchMethodException:
org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(com.esotericsoftware.kryo.Kryo,
java.io.InputStream, java.lang.Class))

I used a super simple query to produce this:

SELECT CONCAT('some_string', some_string_col) FROM some_table;

We suspect, that this comes from an too old Spark Hive Version, which was
used to compile the Spark Version you build in your github project or other
recency problems. We suggest recompiling your Spark Version with the AWS
Hive Version, which has the Hive adaptions you mentioned already
implemented. Or what do you think?

Cheers
Fabian

2015-04-02 10:19 GMT+02:00 Teng Qiu teng...@gmail.com:

-- Forwarded message --
From: Bozeman, Christopher bozem...@amazon.com
Date: 2015-04-01 22:43 GMT+02:00
Subject: RE: Issue on Spark SQL insert or create table with Spark
running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
To: chutium teng@gmail.com, user@spark.apache.org
user@spark.apache.org

Teng,

There is no need to alter hive.metastore.warehouse.dir. Leave it as
is and just create external tables with location pointing to S3.
What I suspect you are seeing is that spark-sql is writing to a temp
directory within S3 then issuing a rename to the final location as
would be done with HDFS. But in S3, there is not a rename operation
so there is a performance hit as S3 performs a copy then delete. I
tested 1TB from/to S3 external tables and it worked, it is just there
the additional delay for the rename (copy).

EMR has modified Hive to avoid the expensive rename and you can take
advantage of this, too with Spark SQL by just copying the EMR Hive
jars into the Spark class path. Like:
/bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp
%% ~/spark/classpath/emr

Please note that since EMR Hive is 0.13 at this time, this does break
some other features already supported by spark-sql if using the
built-in Hive library (for example, AVRO support). So if using this
workaround to make a better performant query when writing to S3 be
sure to test your use-case.

Thanks
Christopher

-Original Message-
From: chutium [mailto:teng@gmail.com]
Sent: Wednesday, April 01, 2015 9:34 AM
To: user@spark.apache.org
Subject: Issue on Spark SQL insert or create table with Spark running
on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

Hi,

we always get issues on inserting or creating table with Amazon EMR
Spark version, by inserting about 1GB resultset, the spark sql query
will never be finished.

by inserting small resultset (like 500MB), works fine.

*spark.sql.shuffle.partitions* by default 200 or *set
spark.sql.shuffle.partitions=1* do not help.

the log stopped at:
*/15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename

s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-1
s3://hive-db/db_xxx/some_huge_table/*

then only metrics.MetricsSaver logs.

we set
/ property
namehive.metastore.warehouse.dir/name
values3://hive-db/value
/property/
but hive.exec.scratchdir ist not set, i have no idea why the tmp files
were created in /s3://hive-db/tmp/hive-hadoop//

we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6
(
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md
),
still not work.

anyone get same issue? any idea about how to fix it?

i believe Amazon EMR's Spark version use
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but
not the original hadoop s3n implementation, right?

/home/hadoop/spark/classpath/emr/*
and
/home/hadoop/spark/classpath/emrfs/*
is in classpath

btw. is there any plan to use the new hadoop s3a implementation instead of
s3n ?

Thanks for any help.

Teng

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
additional commands, e-mail: user-h...@spark.apache.org

--
*Fabian Wollert*
Business Intelligence

*POSTANSCHRIFT*
Zalando SE
11501 Berlin

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong

Oh, I found the reason.
according to the ALS optimization  formula :

If a user’s all ratings are zero, that is,  the R(i, Ii) is a zero matrix, so 
the final result feature of this user will be all-zero vector…


 在 2015年4月2日，下午6:08，Sean Owen so...@cloudera.com 写道：
 
 You're referring to the initialization, not the result, right? It's possible 
 that the resulting weight vectors are sparse although this looks surprising 
 to me. But it is not related to the initial state, right?
 
 On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com 
 mailto:lisend...@163.com wrote:
 I found that there are about 50% all-zero vectors in the ALS result ( both in 
 userFeatures and productFeatures)
 
 the result looks like this:
 
 PastedGraphic-1.tiff
 
 
 I looked into the ALS.scala, the user and product factors seems to be 
 initialized by a gaussian distribution, so it should not be all-zero vector, 
 right?
 
 
  邮件带有附件预览链接，若您转发或回复此邮件时不希望对方预览附件，建议您手动删除链接。
 共有 1 个附件
 PastedGraphic-1.tiff(607K)
 极速下载 
 http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiffmid=1tbiDQnPDVQG6QKGSwAAsypart=3sign=60d32e9f90d3dd36c5858328dd96eabetime=1427969402uid=lisendong%40163.com

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy

You can start with http://spark.apache.org/docs/1.3.0/index.html

Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.

Enjoy!

Vadim
ᐧ

On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote:

 Hi, all



 I am new to here. Could you give me some suggestion to learn Spark ?
 Thanks.



 Best Regards,

 Star Guo

Re: How to learn Spark ?

I have a self-study workshop here:

https://github.com/deanwampler/spark-workshop

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com
 wrote:

 You can start with http://spark.apache.org/docs/1.3.0/index.html

 Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.

 Enjoy!

 Vadim
 ᐧ

 On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote:

 Hi, all



 I am new to here. Could you give me some suggestion to learn Spark ?
 Thanks.



 Best Regards,

 Star Guo

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke

How long does each executor keep the connection open for? How many
connections does each executor open?

Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single connection?

If you need a solution that limits the number of open connections [resource
starvation on the DB server] I think you'd have to fake it with a
centralized counter of active connections, and logic within each executor
that blocks when the counter is at a given threshold. If the counter is not
at threshold, then an active connection can be created (after incrementing
the shared counter). You could use something like ZooKeeper to store the
counter value. This would have the overall effect of decreasing performance
if your required number of connections outstrips the database's resources.

On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

 But this basically means that the pool is confined to the job (of a single
 app) in question, but is not sharable across multiple apps?
 The setup we have is a job server (the spark-jobserver) that creates jobs.
 Currently, we have each job opening and closing a connection to the
 database. What we would like to achieve is for each of the jobs to obtain a
 connection from a db pool

 Any directions on how this can be achieved?

 --
 Sateesh

 On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote:

 Connection pools aren't serializable, so you generally need to set them
 up inside of a closure.  Doing that for every item is wasteful, so you
 typically want to use mapPartitions or foreachPartition

 rdd.mapPartition { part =
 setupPool
 part.map { ...



 See Design Patterns for using foreachRDD in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

 On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Right, I am aware on how to use connection pooling with oracle, but the
 specific question is how to use it in the context of spark job execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee

Thanks Michael - that was it!  I was drawing a blank on this one for some
reason - much appreciated!


On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode shorthand
 directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!

RE: Cannot run the example in the Spark 1.3.0 following the document

Looks like a typo, try:

*df.select**(**df**(name), **df**(age) + 1)*

Or

df.select(name, age)

PRs to fix docs are always appreciated :)
On Apr 2, 2015 7:44 PM, java8964 java8...@hotmail.com wrote:

 The import command already run.

 Forgot the mention, the rest of examples related to df all works, just
 this one caused problem.

 Thanks

 Yong

 --
 Date: Fri, 3 Apr 2015 10:36:45 +0800
 From: fightf...@163.com
 To: java8...@hotmail.com; user@spark.apache.org
 Subject: Re: Cannot run the example in the Spark 1.3.0 following the
 document

 Hi, there

 you may need to add :
   import sqlContext.implicits._

 Best,
 Sun

 --
 fightf...@163.com


 *From:* java8964 java8...@hotmail.com
 *Date:* 2015-04-03 10:15
 *To:* user@spark.apache.org
 *Subject:* Cannot run the example in the Spark 1.3.0 following the
 document
 I tried to check out what Spark SQL 1.3.0. I installed it and following
 the online document here:

 http://spark.apache.org/docs/latest/sql-programming-guide.html

 In the example, it shows something like this:

 // Select everybody, but increment the age by 1df.select(name, df(age) + 
 1).show()// name(age + 1)// Michael null// Andy31// Justin  20


 But what I got on my Spark 1.3.0 is the following error:

 *Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.3.0
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)*

 *scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@1c845f64
 scala val df = sqlContext.jsonFile(/user/yzhang/people.json)*

 *df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]*

 *scala df.printSchema
 root
  |-- age: long (nullable = true)
  |-- name: string (nullable = true)*

 *scala df.select(name, df(age) + 1).show()
 console:30: error: overloaded method value select with alternatives:
   (col: String,cols: String*)org.apache.spark.sql.DataFrame and
   (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
  cannot be applied to (String, org.apache.spark.sql.Column)
   df.select(name, df(age) + 1).show()
  ^*


 Is this a bug in Spark 1.3.0, or my build having some problem?

 Thanks

Re: Connection pooling in spark jobs

But this basically means that the pool is confined to the job (of a single
app) in question, but is not sharable across multiple apps?
The setup we have is a job server (the spark-jobserver) that creates jobs.
Currently, we have each job opening and closing a connection to the
database. What we would like to achieve is for each of the jobs to obtain a
connection from a db pool

Any directions on how this can be achieved?

--
Sateesh

On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote:

 Connection pools aren't serializable, so you generally need to set them up
 inside of a closure.  Doing that for every item is wasteful, so you
 typically want to use mapPartitions or foreachPartition

 rdd.mapPartition { part =
 setupPool
 part.map { ...



 See Design Patterns for using foreachRDD in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

 On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Right, I am aware on how to use connection pooling with oracle, but the
 specific question is how to use it in the context of spark job execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh

Fwd:

2015-04-02 Thread Himanish Kushary

Actually they may not be sequentially generated and also the list (RDD)
could come from a different component.

For example from this RDD :

(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)

I would like to separate into two RDD's :

1) (105,918)
 (502,516)

 2) (105,757)
 (105,137)
  (516,816)
  (350,502)

Right now I am using a mutable Set variable to track the elements already
selected. After coalescing the RDD to a single partition I am doing
something like :

val evalCombinations = collection.mutable.Set.empty[String]

val currentValidCombinations = allCombinations

  .filter(p = {
  if(!evalCombinations.contains(p._1)  !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
  } else
false
})

This approach is limited by memory of the executor this runs
on.Appreciate any better more scalable solution.

Thanks



On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld 
nkronenfeld@uncharted.software wrote:

 You're generating all possible pairs?

 In that case, why not just generate the sequential pairs you want from the
 start?

 On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary himan...@gmail.com
 wrote:

 It will only give (A,B). I am generating the pair from combinations of
 the the strings A,B,C and D, so the pairs (ignoring order) would be

 (A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

 On successful filtering using the original condition it will transform to
 (A,B) and (C,D)

 On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld 
 nkronenfeld@uncharted.software wrote:

 What would it do with the following dataset?

 (A, B)
 (A, C)
 (B, D)


 On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com
 wrote:

 Hi,

 I have a RDD of pairs of strings like below :

 (A,B)
 (B,C)
 (C,D)
 (A,D)
 (E,F)
 (B,F)

 I need to transform/filter this into a RDD of pairs that does not
 repeat a string once it has been used once. So something like ,

 (A,B)
 (C,D)
 (E,F)

 (B,C) is out because B has already ben used in (A,B), (A,D) is out
 because A (and D) has been used etc.

 I was thinking of a option of using a shared variable to keep track of
 what has already been used but that may only work for a single partition
 and would not scale for larger dataset.

 Is there any other efficient way to accomplish this ?

 --
 Thanks  Regards
 Himanish





 --
 Thanks  Regards
 Himanish





-- 
Thanks  Regards
Himanish



-- 
Thanks  Regards
Himanish

Matei Zaharai: Reddit Ask Me Anything

2015-04-02 Thread ben lorica

*Ask Me Anything about Apache Spark  big data*
Reddit AMA with Matei Zaharia
Friday, April 3 at 9AM PT/ 12PM ET

Details can be found here:  
http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Connection pooling in spark jobs

Each executor runs for about 5 secs until which time the db connection can
potentially be open. Each executor will have 1 connection open.
Connection pooling surely has its advantages of performance and not hitting
the dbserver for every open/close. The database in question is not just
used by the spark jobs, but is shared by other systems and so the spark
jobs have to better at managing the resources.

I am not really looking for a db connections counter (will let the db
handle that part), but rather have a pool of connections on spark end so
that the connections can be reused across jobs

On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com
wrote:

How long does each executor keep the connection open for? How many
connections does each executor open?

Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single connection?

If you need a solution that limits the number of open connections
[resource starvation on the DB server] I think you'd have to fake it with a
centralized counter of active connections, and logic within each executor
that blocks when the counter is at a given threshold. If the counter is not
at threshold, then an active connection can be created (after incrementing
the shared counter). You could use something like ZooKeeper to store the
counter value. This would have the overall effect of decreasing performance
if your required number of connections outstrips the database's resources.

On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

But this basically means that the pool is confined to the job (of a
single app) in question, but is not sharable across multiple apps?
The setup we have is a job server (the spark-jobserver) that creates
jobs. Currently, we have each job opening and closing a connection to the
database. What we would like to achieve is for each of the jobs to obtain a
connection from a db pool

Any directions on how this can be achieved?

--
Sateesh

On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org
wrote:

Connection pools aren't serializable, so you generally need to set them
up inside of a closure. Doing that for every item is wasteful, so you
typically want to use mapPartitions or foreachPartition

rdd.mapPartition { part =
setupPool
part.map { ...

See Design Patterns for using foreachRDD in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

Right, I am aware on how to use connection pooling with oracle, but the
specific question is how to use it in the context of spark job execution
On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

The question doesn't seem to be Spark specific, btw

On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

Hi,

We have a case that we will have to run concurrent jobs (for the
same algorithm) on different data sets. And these jobs can run in parallel
and each one of them would be fetching the data from the database.
We would like to optimize the database connections by making use of
connection pooling. Any suggestions / best known ways on how to achieve
this. The database in question is Oracle

Thanks,
Sateesh

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan

Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?

My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
read the entire file.

On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.

Re: workers no route to host

It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT,
which expects to find the hadoop command:

/data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop:
command not found

If you don't want to use Hadoop, download one of the pre-built Spark
releases from spark.apache.org. Even the Hadoop builds there will work
okay, as they don't actually attempt to run Hadoop commands.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 31, 2015 at 3:12 AM, ZhuGe t...@outlook.com wrote:

 Hi,
 i set up a standalone cluster of 5 machines(tmaster, tslave1,2,3,4) with
 spark-1.3.0-cdh5.4.0-snapshort.
 when i execute the sbin/start-all.sh, the master is ok, but i cant see the
 web ui. Moreover, the worker logs is something like this:

 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop:
 command not found
 Spark Command: java -cp
 :/data/PlatformDep/cdh5/dist/sbin/../conf:/data/PlatformDep/cdh5/dist/lib/spark-assembly-1.3.0-cdh5.4.0-SNAPSHOT-hadoop2.6.0-cdh5.4.0-SNAPSHOT.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-rdbms-3.2.1.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-api-jdo-3.2.1.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-core-3.2.2.jar:
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
 org.apache.spark.deploy.worker.Worker spark://192.168.128.16:7071
 --webui-port 8081
 

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/31 06:47:22 INFO Worker: Registered signal handlers for [TERM, HUP,
 INT]
 15/03/31 06:47:23 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/31 06:47:23 INFO SecurityManager: Changing view acls to: dcadmin
 15/03/31 06:47:23 INFO SecurityManager: Changing modify acls to: dcadmin
 15/03/31 06:47:23 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(dcadmin);
 users with modify permissions: Set(dcadmin)
 15/03/31 06:47:23 INFO Slf4jLogger: Slf4jLogger started
 15/03/31 06:47:23 INFO Remoting: Starting remoting
 15/03/31 06:47:23 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkWorker@tslave2:60815]
 15/03/31 06:47:24 INFO Utils: Successfully started service 'sparkWorker'
 on port 60815.
 15/03/31 06:47:24 INFO Worker: Starting Spark worker tslave2:60815 with 2
 cores, 3.0 GB RAM
 15/03/31 06:47:24 INFO Worker: Running Spark version 1.3.0
 15/03/31 06:47:24 INFO Worker: Spark home: /data/PlatformDep/cdh5/dist
 15/03/31 06:47:24 INFO Server: jetty-8.y.z-SNAPSHOT
 15/03/31 06:47:24 INFO AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:8081
 15/03/31 06:47:24 INFO Utils: Successfully started service 'WorkerUI' on
 port 8081.
 15/03/31 06:47:24 INFO WorkerWebUI: Started WorkerWebUI at
 http://tslave2:8081
 15/03/31 06:47:24 INFO Worker: Connecting to master akka.tcp://
 sparkMaster@192.168.128.16:7071/user/Master...
 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp://
 sparkMaster@192.168.128.16:7071]: Error [Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No
 route to host
 ]
 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp://
 sparkMaster@192.168.128.16:7071]: Error [Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No
 route to host
 ]
 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp://
 sparkMaster@192.168.128.16:7071]: Error [Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No
 route to host
 ]
 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp://
 sparkMaster@192.168.128.16:7071]: Error [Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkMaster@192.168.128.16:7071]



 the worker machines

conversion from java collection type to scala JavaRDDObject

2015-04-02 Thread Jeetendra Gangele

Hi All
Is there an way to make the JavaRDDObject from existing java collection
type ListObject?
I know this can be done using scala , but i am looking how to do this using
java.


Regards
Jeetendra

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers

Thanks all. I was able to get the decompression working by adding the
following to my spark-env.sh script:

export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar

On Thu, Apr 2, 2015 at 12:51 AM, Sean Owen so...@cloudera.com wrote:

 Yes, any Hadoop-related process that asks for Snappy compression or
 needs to read it will have to have the Snappy libs available on the
 library path. That's usually set up for you in a distro or you can do
 it manually like this. This is not Spark-specific.

 The second question also isn't Spark-specific; you do not have a
 SequenceFile of byte[] / String, but of byte[] / byte[]. Review what
 you are writing since it is not BytesWritable / Text.

 On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com
 wrote:
  I'm actually running this in a separate environment to our HDFS cluster.
 
  I think I've been able to sort out the issue by copying
  /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm
 just
  using a one-worker setup at present) and adding the following to
  spark-env.sh:
 
  export
  JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export
  SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
  export
 
 SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar
 
  I can get past the previous error. The issue now seems to be with what is
  being returned.
 
  import org.apache.hadoop.io._
  val hdfsPath = hdfs://nost.name/path/to/folder
  val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
  file.count()
 
  returns the following error:
 
  java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot
 be
  cast to org.apache.hadoop.io.Text
 
 
  On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote:
 
  Do you have the same hadoop config for all nodes in your cluster(you run
  it in a cluster, right?)?
  Check the node(usually the executor) which gives the
  java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the
  hadoop native lib path.
 
  On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote:
 
  Thanks for the super quick response!
 
  I can read the file just fine in hadoop, it's just when I point Spark at
  this file it can't seem to read it due to the missing snappy jars /
 so's.
 
  I'l paying around with adding some things to spark-env.sh file, but
 still
  nothing.
 
  On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com
 wrote:
 
  Can you read snappy compressed file in hdfs?  Looks like the
 libsnappy.so
  is not in the hadoop native lib path.
 
  On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:
 
  Has anyone else encountered the following error when trying to read a
  snappy
  compressed sequence file from HDFS?
 
  *java.lang.UnsatisfiedLinkError:
  org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*
 
  The following works for me when the file is uncompressed:
 
  import org.apache.hadoop.io._
  val hdfsPath = hdfs://nost.name/path/to/folder
  val file = sc.sequenceFile[BytesWritable,String](hdfsPath)
  file.count()
 
  but fails when the encoding is Snappy.
 
  I've seen some stuff floating around on the web about having to
 explicitly
  enable support for Snappy in spark, but it doesn't seem to work for me:
  http://www.ericlin.me/enabling-snappy-support-for-sharkspark
  http://www.ericlin.me/enabling-snappy-support-for-sharkspark
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL. Memory consumption

2015-04-02 Thread Masf

Hi.

I'm using Spark SQL 1.2. I have this query:

CREATE TABLE test_MA STORED AS PARQUET AS
 SELECT
field1
,field2
,field3
,field4
,field5
,COUNT(1) AS field6
,MAX(field7)
,MIN(field8)
,SUM(field9 / 100)
,COUNT(field10)
,SUM(IF(field11  -500, 1, 0))
,MAX(field12)
,SUM(IF(field13 = 1, 1, 0))
,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0))
,SUM(IF(field13 = 2012 , 1, 0))
,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))
 FROM table1 CL
JOIN table2 netw
ON CL.field15 = netw.id
WHERE
AND field3 IS NOT NULL
AND field4 IS NOT NULL
AND field5 IS NOT NULL
GROUP BY field1,field2,field3,field4, netw.field5


spark-submit --master spark://master:7077 *--driver-memory 20g
--executor-memory 60g* --class GMain project_2.10-1.0.jar
--driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*'
--driver-java-options
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*'
2 ./error


Input data is 8GB in parquet format. Many times crash by *GC overhead*.
I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB
RAM/node) is collapsed.

*Is it a query too difficult to Spark SQL? *
*Would It be better to do it in Spark?*
*Am I doing something wrong?*


Thanks
-- 


Regards.
Miguel Ángel

Re: How to learn Spark ?

Thank you ! I Begin with it.

 

Best Regards,

Star Guo

 



 

I have a self-study workshop here:

 

https://github.com/deanwampler/spark-workshop

 

dean




Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition 
http://shop.oreilly.com/product/0636920033073.do  (O'Reilly)

Typesafe http://typesafe.com 
@deanwampler http://twitter.com/deanwampler 

http://polyglotprogramming.com

 

On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com 
wrote:

You can start with http://spark.apache.org/docs/1.3.0/index.html

 

Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.

 

Enjoy!

 

Vadim

  
https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=25ae00bb-d455-45e8-994c-b0e83ee8f68c
 ᐧ

  
http://t.signauxtrois.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752pi=9f8cc75d-3c1b-4f69-ef56-1f207f8f09f1
 

 

On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote:

Hi, all

 

I am new to here. Could you give me some suggestion to learn Spark ? Thanks.

 

Best Regards,

Star Guo

Spark Streaming Error in block pushing thread

2015-04-02 Thread byoung

I am running a spark streaming stand-alone cluster, connected to rabbitmq
endpoint(s). The application will run for 20-30 minutes before failing with
the following error:

WARN 2015-04-01 21:00:53,944
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 22 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,944
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 23 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,951
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 20 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,951
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 19 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,952
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 18 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,952
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 17 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:53,952
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 16 - Ask timed out on
[Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
after [3 ms]}
WARN 2015-04-01 21:00:54,151
org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error
reported by receiver for stream 0: Error in block pushing thread -
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166)
at
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
at
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112)
at
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182)
at
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155)
at
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87)


Has anyone run into this before? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young

Thank you for the response, Dean. There are 2 worker nodes, with 8 cores
total, attached to the stream. I have the following settings applied:

spark.executor.memory 21475m
spark.cores.max 16
spark.driver.memory 5235m


On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com wrote:

 Are you allocating 1 core per input stream plus additional cores for the
 rest of the processing? Each input stream Reader requires a dedicated core.
 So, if you have two input streams, you'll need local[3] at least.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com
 wrote:

 I am running a spark streaming stand-alone cluster, connected to rabbitmq
 endpoint(s). The application will run for 20-30 minutes before failing
 with
 the following error:

 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 22 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 23 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 20 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 19 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 18 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 17 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 16 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:54,151
 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error
 reported by receiver for stream 0: Error in block pushing thread -
 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at

 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112)
 at

 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182)
 at
 org.apache.spark.streaming.receiver.BlockGenerator.org
 $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155)
 at

 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87)


 Has anyone run into this before?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
--
Bill Young
Threat Stack | Senior Infrastructure Engineer
http://www.threatstack.com

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young

Sorry for the obvious typo, I have 4 workers with 16 cores total*

On Thu, Apr 2, 2015 at 11:56 AM, Bill Young bill.yo...@threatstack.com
wrote:

 Thank you for the response, Dean. There are 2 worker nodes, with 8 cores
 total, attached to the stream. I have the following settings applied:

 spark.executor.memory 21475m
 spark.cores.max 16
 spark.driver.memory 5235m


 On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 Are you allocating 1 core per input stream plus additional cores for the
 rest of the processing? Each input stream Reader requires a dedicated core.
 So, if you have two input streams, you'll need local[3] at least.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com
 wrote:

 I am running a spark streaming stand-alone cluster, connected to rabbitmq
 endpoint(s). The application will run for 20-30 minutes before failing
 with
 the following error:

 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 22 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 23 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 20 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 19 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 18 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 17 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to
 remove
 RDD 16 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:54,151
 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error
 reported by receiver for stream 0: Error in block pushing thread -
 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at

 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112)
 at

 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182)
 at
 org.apache.spark.streaming.receiver.BlockGenerator.org
 $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155)
 at

 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87)


 Has anyone run into this before?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 --
 Bill Young
 Threat Stack | Senior Infrastructure Engineer
 http://www.threatstack.com




-- 
--
Bill Young
Threat Stack | Senior Infrastructure Engineer
http://www.threatstack.com

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko


Hi try next code:

|val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case 
Row(feture1, feture2,..., label) = LabeledPoint(label, 
Vectors.dense(feature1, feature2, ...)) } |


Thanks,
Peter Rudenko

On 2015-04-02 17:17, drarse wrote:


Hello!,

I have a questions since days ago. I am working with DataFrame and with
Spark SQL I imported a jsonFile:

/val df = sqlContext.jsonFile(file.json)/

In this json I have the label and de features. I selected it:

/
val features = df.select (feature1,feature2,feature3,...);

val labels = df.select (cassification)/

But, now, I don't know create a LabeledPoint for RandomForest. I tried some
solutions without success. Can you help me?

Thanks for all!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re How to learn Spark ?

So cool !! Thanks.

Best Regards,
Star Guo

=

You can also refer this blog http://blog.prabeeshk.com/blog/archives/

On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote:
Hi, all
 
I am new to here. Could you give me some suggestion to learn Spark ? Thanks.
 
Best Regards,
Star Guo



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: conversion from java collection type to scala JavaRDDObject

Use JavaSparkContext.parallelize.

http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Hi All
 Is there an way to make the JavaRDDObject from existing java collection
 type ListObject?
 I know this can be done using scala , but i am looking how to do this
 using java.


 Regards
 Jeetendra

Re: Spark Streaming Error in block pushing thread

Are you allocating 1 core per input stream plus additional cores for the
rest of the processing? Each input stream Reader requires a dedicated core.
So, if you have two input streams, you'll need local[3] at least.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com wrote:

 I am running a spark streaming stand-alone cluster, connected to rabbitmq
 endpoint(s). The application will run for 20-30 minutes before failing with
 the following error:

 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 22 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 23 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 20 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 19 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 18 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 17 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 16 - Ask timed out on
 [Actor[akka.tcp://
 sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:54,151
 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error
 reported by receiver for stream 0: Error in block pushing thread -
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at

 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
 at

 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112)
 at

 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182)
 at
 org.apache.spark.streaming.receiver.BlockGenerator.org
 $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155)
 at

 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87)


 Has anyone run into this before?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

To clarify one thing, is count() the first action (
http://spark.apache.org/docs/latest/programming-guide.html#actions) you're
attempting? As defined in the programming guide, an action forces
evaluation of the pipeline of RDDs. It's only then that reading the data
actually occurs. So, count() might not be the issue, but some upstream step
that attempted to read the file.

As a sanity check, if you just read the text file and don't convert the
strings, then call count(), does that work? If so, it might be something
about your JavaBean BERecord after all. Can you post its definition?

Also calling take(1) to grab the first element should also work, even if
the RDD is empty. (It will return an empty RDD in that case, but not throw
an exception.)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 10:16 AM, Ashley Rose ashley.r...@telarix.com
wrote:

  That’s precisely what I was trying to check. It should have 42577
 records in it, because that’s how many there were in the text file I read
 in.



 // Load a text file and convert each line to a JavaBean.

 JavaRDDString lines = sc.textFile(file.txt);



 JavaRDDBERecord tbBER = lines.map(s - convertToBER(s));



 // Apply a schema to an RDD of JavaBeans and register it as a
 table.

 schemaBERecords = sqlContext.createDataFrame(tbBER,
 BERecord.class);

 schemaBERecords.registerTempTable(tbBER);



 The BERecord class is a standard Java Bean that implements Serializable,
 so that shouldn’t be the issue. As you said, count() shouldn’t fail like
 this even if the table was empty. I was able to print out the schema of the
 DataFrame just fine with df.printSchema(), and I just wanted to see if data
 was populated correctly.



 *From:* Dean Wampler [mailto:deanwamp...@gmail.com]
 *Sent:* Wednesday, April 01, 2015 6:05 PM
 *To:* Ashley Rose
 *Cc:* user@spark.apache.org
 *Subject:* Re: Spark 1.3.0 DataFrame count() method throwing
 java.io.EOFException



 Is it possible tbBER is empty? If so, it shouldn't fail like this, of
 course.


   Dean Wampler, Ph.D.

 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler

 http://polyglotprogramming.com



 On Wed, Apr 1, 2015 at 5:57 PM, ARose ashley.r...@telarix.com wrote:

 Note: I am running Spark on Windows 7 in standalone mode.

 In my app, I run the following:

 DataFrame df = sqlContext.sql(SELECT * FROM tbBER);
 System.out.println(Count:  + df.count());

 tbBER is registered as a temp table in my SQLContext. When I try to print
 the number of rows in the DataFrame, the job fails and I get the following
 error message:

 java.io.EOFException
 at

 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
 at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
 at

 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
 at
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
 at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
 at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
 at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
 at
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
 at
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
 at

 org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137)
 at

 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at

Spark Streaming Worker runs out of inodes

2015-04-02 Thread andrem

Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see tons of old shuffle_*.data and *.index files that are never deleted.
How do we get Spark to remove these files?

We have a simple standalone app with one RabbitMQ receiver and a two node
cluster (2 x r3large AWS instances).
Batch interval is 10 minutes after which we process data and write results
to DB. No windowing or state mgmt is used.

I've poured over the documentation and tried setting the following
properties but they have not helped.
As a work around we're using a cron script that periodically cleans up old
files but this has a bad smell to it.

SPARK_WORKER_OPTS in spark-env.sh on every worker node
  spark.worker.cleanup.enabled true
  spark.worker.cleanup.interval
  spark.worker.cleanup.appDataTtl

Also tried on the driver side:
  spark.cleaner.ttl
  spark.shuffle.consolidateFiles true



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Re：How to learn Spark ?

Thanks a lot. Follow you suggestion .

Best Regards,
Star Guo

=

The best way of learning spark is to use spark
you may follow the instruction of apache spark 
website.http://spark.apache.org/docs/latest/
 
download-deploy it in standalone mode-run some examples-try cluster deploy 
mode- then try to develop your own app and deploy it in your spark cluster.
 
and it's better to learn scala well if you wanna dive into spark.
Also there are some books about spark.


 
Thanksamp;Best regards!
San.Luo

- 原始邮件 -
发件人：Star Guo st...@ceph.me
收件人：user@spark.apache.org
主题：How to learn Spark ?
日期：2015年04月02日 16点19分

Hi, all
 
I am new to here. Could you give me some suggestion to learn Spark ? Thanks.
 
Best Regards,
Star Guo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

A stream of json objects using Java

2015-04-02 Thread James King

I'm reading a stream of string lines that are in json format.

I'm using Java with Spark.

Is there a way to get this from a transformation? so that I end up with a
stream of JSON objects.

I would also welcome any feedback about this approach or alternative
approaches.

thanks
jk

Spark streaming error in block pushing thread

2015-04-02 Thread Bill Young

I am running a standalone Spark streaming cluster, connected to multiple
RabbitMQ endpoints. The application will run for 20-30 minutes before
raising the following error:

WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 22 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,944
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 23 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 20 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,951
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 19 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 18 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 17 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:53,952
 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
 RDD 16 - Ask timed out on [Actor[
 akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]]
 after [3 ms]}
 WARN 2015-04-01 21:00:54,151
 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error
 reported by receiver for stream 0: Error in block pushing thread -
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127)
 at
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112)
 at
 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182)
 at org.apache.spark.streaming.receiver.BlockGenerator.org
 http://org.apache.spark.streaming.receiver.blockgenerator.org/
 $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155)
 at
 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87)


Has anyone run into this before?

--
Bill Young
Threat Stack | Infrastructure Engineer
http://www.threatstack.com

RE: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Ashley Rose

That’s precisely what I was trying to check. It should have 42577 records in 
it, because that’s how many there were in the text file I read in.

// Load a text file and convert each line to a JavaBean.
JavaRDDString lines = sc.textFile(file.txt);

JavaRDDBERecord tbBER = lines.map(s - convertToBER(s));

// Apply a schema to an RDD of JavaBeans and register it as a table.
schemaBERecords = sqlContext.createDataFrame(tbBER, BERecord.class);
schemaBERecords.registerTempTable(tbBER);

The BERecord class is a standard Java Bean that implements Serializable, so 
that shouldn’t be the issue. As you said, count() shouldn’t fail like this even 
if the table was empty. I was able to print out the schema of the DataFrame 
just fine with df.printSchema(), and I just wanted to see if data was populated 
correctly.

From: Dean Wampler [mailto:deanwamp...@gmail.com]
Sent: Wednesday, April 01, 2015 6:05 PM
To: Ashley Rose
Cc: user@spark.apache.org
Subject: Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

Is it possible tbBER is empty? If so, it shouldn't fail like this, of course.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Editionhttp://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafehttp://typesafe.com
@deanwamplerhttp://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Apr 1, 2015 at 5:57 PM, ARose 
ashley.r...@telarix.commailto:ashley.r...@telarix.com wrote:
Note: I am running Spark on Windows 7 in standalone mode.

In my app, I run the following:

DataFrame df = sqlContext.sql(SELECT * FROM tbBER);
System.out.println(Count:  + df.count());

tbBER is registered as a temp table in my SQLContext. When I try to print
the number of rows in the DataFrame, the job fails and I get the following
error message:

java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
at
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137)
at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This only happens when I try to call df.count(). The rest runs fine. Is the
count() function not supported in standalone mode? The stack trace makes it
appear to be Hadoop functionality...



--
View this message in context:

From DataFrame to LabeledPoint

2015-04-02 Thread drarse

Hello!,

I have a questions since days ago. I am working with DataFrame and with
Spark SQL I imported a jsonFile:

/val df = sqlContext.jsonFile(file.json)/

In this json I have the label and de features. I selected it:

/
val features = df.select (feature1,feature2,feature3,...);

val labels = df.select (cassification)/

But, now, I don't know create a LabeledPoint for RandomForest. I tried some
solutions without success. Can you help me?

Thanks for all!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: A stream of json objects using Java

This just reduces to finding a library that can translate a String of
JSON into a POJO, Map, or other representation of the JSON. There are
loads of these, like Gson or Jackson. Sure, you can easily use these
in a function that you apply to each JSON string in each line of the
file. It's not different when this is run in Spark.

On Thu, Apr 2, 2015 at 2:22 PM, James King jakwebin...@gmail.com wrote:
 I'm reading a stream of string lines that are in json format.

 I'm using Java with Spark.

 Is there a way to get this from a transformation? so that I end up with a
 stream of JSON objects.

 I would also welcome any feedback about this approach or alternative
 approaches.

 thanks
 jk





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann

You shouldn't need to do anything special. Are you using a named context?
I'm not sure those work with SparkSqlJob.

By the way, there is a forum on Google groups for the Spark Job Server:
https://groups.google.com/forum/#!forum/spark-jobserver

On Thu, Apr 2, 2015 at 5:10 AM, Harika matha.har...@gmail.com wrote:

 Hi,

 I am trying to Spark Jobserver(
 https://github.com/spark-jobserver/spark-jobserver
 https://github.com/spark-jobserver/spark-jobserver  ) for running Spark
 SQL jobs.

 I was able to start the server but when I run my application(my Scala class
 which extends SparkSqlJob), I am getting the following as response:

 {
   status: ERROR,
   result: Invalid job type for this context
 }

 Can any one suggest me what is going wrong or provide a detailed procedure
 for setting up jobserver for SparkSQL?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Setup-Spark-jobserver-for-Spark-SQL-tp22352.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Connection pooling in spark jobs

2015-04-02 Thread Cody Koeninger

Connection pools aren't serializable, so you generally need to set them up
inside of a closure.  Doing that for every item is wasteful, so you
typically want to use mapPartitions or foreachPartition

rdd.mapPartition { part =
setupPool
part.map { ...



See Design Patterns for using foreachRDD in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

 Right, I am aware on how to use connection pooling with oracle, but the
 specific question is how to use it in the context of spark job execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:

 http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

 The question doesn't seem to be Spark specific, btw




  On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:
 
  Hi,
 
  We have a case that we will have to run concurrent jobs (for the same
 algorithm) on different data sets. And these jobs can run in parallel and
 each one of them would be fetching the data from the database.
  We would like to optimize the database connections by making use of
 connection pooling. Any suggestions / best known ways on how to achieve
 this. The database in question is Oracle
 
  Thanks,
  Sateesh

A problem with Spark 1.3 artifacts

2015-04-02 Thread Jacek Lewandowski

A very simple example which works well with Spark 1.2, and fail to compile
with Spark 1.3:

build.sbt:

name := untitled
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %% spark-core % 1.3.0

Test.scala:

package org.apache.spark.metrics
import org.apache.spark.SparkEnv
class Test {
  SparkEnv.get.metricsSystem.report()
}

Produces:

Error:scalac: bad symbolic reference. A signature in MetricsSystem.class
refers to term eclipse
in package org which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
MetricsSystem.class.

Error:scalac: bad symbolic reference. A signature in MetricsSystem.class
refers to term jetty
in value org.eclipse which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
MetricsSystem.class.

This looks like something wrong with shading jetty.
MetricsSystem references MetricsServlet which references some classes from
Jetty, in the original package instead of shaded one. I'm not sure, but
adding the following dependencies solves the problem:

libraryDependencies += org.eclipse.jetty % jetty-server %
8.1.14.v20131031
libraryDependencies += org.eclipse.jetty % jetty-servlet %
8.1.14.v20131031

Is it intended or is it a bug?


Thanks !


Jacek

Re: How to learn Spark ?