Spark S3 LZO input files

2014-07-03 Thread hassan
I'm trying to read input files from S3. The files are compressed using LZO. i-e from spark-shell sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?' Spark does not uncompress the data from the file. I am using cloudera manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS'

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Thunder, Please understand that both MLlib and breeze are in active development. Before v1.0, we used jblas but in the public APIs we only exposed Array[Double]. In v1.0, we introduced Vector that supports both dense and sparse data and switched the backend to breeze/netlib-java (except ALS).

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Dmitriy, It is sweet to have the bindings, but it is very easy to downgrade the performance with them. The BLAS/LAPACK APIs have been there for more than 20 years and they are still the top choice for high-performance linear algebra. I'm thinking about whether it is possible to make the

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-03 Thread Sunita Arvind
That's good to know. I will try it out. Thanks Romain On Friday, June 27, 2014, Romain Rigaux romain.rig...@gmail.com wrote: So far Spark Job Server does not work with Spark 1.0: https://github.com/ooyala/spark-jobserver So this works only with Spark 0.9 currently:

Re: Run spark unit test on Windows 7

2014-07-03 Thread Konstantin Kudryavtsev
It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from

Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
Are you having sbt directory inside your spark directory? Thanks Best Regards On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com wrote: Hi, I'm trying to install spark 1 on my hadoop cluster running on EMR. I didn't have any problem installing the previous versions, but

Re: Spark SQL - groupby

2014-07-03 Thread Subacini B
Hi, Can someone provide me pointers for this issue. Thanks Subacini On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote: Hi, Below code throws compilation error , not found: *value Sum* . Can someone help me on this. Do i need to add any jars or imports ? even for Count

Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
If you have downloaded the pre-compiled binary, it will not have sbt directory inside it. Thanks Best Regards On Thu, Jul 3, 2014 at 12:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you having sbt directory inside your spark directory? Thanks Best Regards On Wed, Jul 2, 2014 at

Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
Hi Sameer, If you set those two IDs to be a Tuple2 in the key of the RDD, then you can join on that tuple. Example: val rdd1: RDD[Tuple3[Int, Int, String]] = ... val rdd2: RDD[Tuple3[Int, Int, String]] = ... val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join( rdd2.map(k =

Re: Spark SQL - groupby

2014-07-03 Thread Takuya UESHIN
Hi, You need to import Sum and Count like: import org.apache.spark.sql.catalyst.expressions.{Sum,Count} // or with wildcard _ or if you use current master branch build, you can use sum('colB) instead of Sum('colB). Thanks. 2014-07-03 16:09 GMT+09:00 Subacini B subac...@gmail.com: Hi, Can

Re: Shark Vs Spark SQL

2014-07-03 Thread 田毅
add MASTER=yarn-client then the JDBC / Thrift server will run on yarn 2014-07-02 16:57 GMT-07:00 田毅 tia...@asiainfo.com: hi, Matei Do you know how to run the JDBC / Thrift server on Yarn? I did not find any suggestion in docs. 2014-07-02 16:06 GMT-07:00 Matei Zaharia

Case class in java

2014-07-03 Thread Kevin Jung
Hi, I'm trying to convert scala spark job into java. In case of scala, I typically use 'case class' to apply schema to RDD. It can be converted into POJO class in java, but what I really want to do is dynamically creating POJO classes like scala REPL do. For this reason, I import javassist to

Re: write event logs with YARN

2014-07-03 Thread Christophe Préaud
Hi Andrew, This does not work (the application failed), I have the following error when I put 3 slashes in the hdfs scheme: (...) Caused by: java.lang.IllegalArgumentException: Pathname

Re: java options for spark-1.0.0

2014-07-03 Thread Wanda Hawk
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line export _JAVA_OPTIONS=...)

Which version of Hive support Spark Shark

2014-07-03 Thread Ravi Prasad
Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T

Re: Case class in java

2014-07-03 Thread Kevin Jung
I found a web page for hint. http://ardoris.wordpress.com/2014/03/30/how-spark-does-class-loading/ I learned SparkIMain has internal httpserver to publish class object but can't figure out how I use it in java. Any ideas? Thanks, Kevin -- View this message in context:

hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu
can i enable spark to use dfs.client.read.shortcircuit property to improve performance and ready natively on local nodes instead of hdfs api ? The information contained in this message may be confidential and legally protected under applicable law. The message

Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-03 Thread Surendranauth Hiraman
I've had some odd behavior with jobs showing up in the history server in 1.0.0. Failed jobs do show up but it seems they can show up minutes or hours later. I see in the history server logs messages about bad task ids. But then eventually the jobs show up. This might be your situation.

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-03 Thread Wanda Hawk
I have given this a try in a spark-shell and I still get many Allocation Failures On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote: The SparkKMeans is just an example code showing a barebone implementation of k-means. To run k-means on big datasets, please use the

Reading text file vs streaming text files

2014-07-03 Thread M Singh
Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes.  Data from the file will used to update counters in cassandra (non-idempotent operation).  I was wondering what is the best to deal with this: * Use text

matchError:null in ALS.train

2014-07-03 Thread Honey Joshi
Hi All, We are using ALS.train to generate a model for predictions. We are using DStream[] to collect the predicted output and then trying to dump in a text file using these two approaches dstream.saveAsTextFiles() and dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are

Re: Reading text file vs streaming text files

2014-07-03 Thread Akhil Das
Hi Singh! For this use-case its better to have a Streaming context listening to that directory in hdfs where the files are being dropped and you can set the Streaming interval as 15 minutes and let this driver program run continuously, so as soon as new files are arrived they are taken for

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-03 Thread Honey Joshi
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote: two job context cannot share data, are you collecting the data to the master then sending it to the other context? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-03 Thread Eustache DIEMERT
Printing the model show the intercept is always 0 :( Should I open a bug for that ? 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT eusta...@diemert.fr: Hi list, I'm benchmarking MLlib for a regression task [1] and get strange results. Namely, using RidgeRegressionWithSGD it seems the

Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread jackxucs
Hello, I am running the BroadcastTest example in a standalone cluster using spark-submit. I have 8 host machines and made Host1 the master. Host2 to Host8 act as 7 workers to connect to the master. The connection was fine as I could see all 7 hosts on the master web ui. The BroadcastTest example

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Hi Konstantin, Could you please create a jira item at:  https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev (kudryavtsev.konstan...@gmail.com) wrote: It sounds really strange... I guess it is a bug,

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
That’s an obvious workaround, yes, thank you Tobias. However, I’m prototyping substitution to real batch process, where I’d have to create six streams (and possibly more). It could be a bit messy. On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call,

Re: reduceByKey Not Being Called by Spark Streaming

2014-07-03 Thread Dan H.
Hi All, I was able to resolve this matter with a simple fix. It seems that in order to process a reduceByKey and the flat map operations at the same time, the only way to resolve was to increase the number of threads to 1. Since I'm developing on my personal machine for speed, I simply updated

Re: write event logs with YARN

2014-07-03 Thread Andrew Or
Hi Christophe, another Andrew speaking. Your configuration looks fine to me. From the stack trace it seems that we are in fact closing the file system pre-maturely elsewhere in the system, such that when it tries to write the APPLICATION_COMPLETE file it throws the exception you see. This does

spark text processing

2014-07-03 Thread M Singh
Hi: Is there a way to find out when spark has finished processing a text file (both for streaming and non-streaming cases) ? Also, after processing, can spark copy the file to another directory ? Thanks

Re: issue with running example code

2014-07-03 Thread Gurvinder Singh
Just to provide more information on this issue. It seems that SPARK_HOME environment variable is causing the issue. If I unset the variable in spark-class script and run in the local mode my code runs fine without the exception. But if I run with SPARK_HOME, I get the exception mentioned below. I

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Dmitriy Lyubimov
On Wed, Jul 2, 2014 at 11:40 PM, Xiangrui Meng men...@gmail.com wrote: Hi Dmitriy, It is sweet to have the bindings, but it is very easy to downgrade the performance with them. The BLAS/LAPACK APIs have been there for more than 20 years and they are still the top choice for high-performance

Re: Run spark unit test on Windows 7

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi Denny, just created https://issues.apache.org/jira/browse/SPARK-2356 On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote: Hi Konstantin, Could you please create a jira item at: https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Thanks! will take a look at this later today. HTH! On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Denny, just created https://issues.apache.org/jira/browse/SPARK-2356 On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote:

Anaconda Spark AMI

2014-07-03 Thread Benjamin Zaitlen
Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was

Re: Spark Streaming Error Help - ERROR actor.OneForOneStrategy: key not found:

2014-07-03 Thread jschindler
I think I have found my answers but if anyone has thoughts please share. After testing for a while I think the error doesn't have any effect on the process. I think it is the case that there must be elements left in the window from last run otherwise my system is completely whack. Please let me

Spark logging strategy on YARN

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi all, Could you please share your the best practices on writing logs in Spark? I’m running it on YARN, so when I check logs I’m bit confused… Currently, I’m writing System.err.println to put a message in log and access it via YARN history server. But, I don’t like this way… I’d like to use

Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to

Re: LIMIT with offset in SQL queries

2014-07-03 Thread Michael Armbrust
Doing an offset is actually pretty expensive in a distributed query engine, so in many cases it probably makes sense to just collect and then perform the offset as you are doing now. This is unless the offset is very large. Another limitation here is that HiveQL does not support OFFSET. That

Re: Which version of Hive support Spark Shark

2014-07-03 Thread Michael Armbrust
Spark SQL is based on Hive 0.12.0. On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote: Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T

Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread Mosharaf Chowdhury
Hi Jack, 1. Several previous instances of key not valid? error had been attributed to memory issues, either memory allocated per executor or per task, depending on the context. You can google it to see some examples. 2. I think your case is similar, even though its happening due to

Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions For svm there are a couple of ad click prediction datasets of pretty large size. For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ — Sent from Mailbox On Thu, Jul 3, 2014 at

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
The Kaggle data is not in libsvm format so you'd have to do some transformation. The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think. To my knowledge these are the largest binary classification datasets I've come across

Re: Case class in java

2014-07-03 Thread Kevin Jung
This will load listed jars when SparkContext is created. In case of REPL, we define and import classes after SparkContext created. According to above mentioned site, Executor install class loader in 'addReplClassLoaderIfNeeded' method using spark.repl.class.uri configuration. Then I will try to

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Tobias Pfeiffer
Sergey, On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote: On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call, calls ConsumerConnector.createMessageStream which returns a Map[String, List[KafkaStream] keyed by topic. It is,

[no subject]

2014-07-03 Thread Steven Cox
Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - *

No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
...and a real subject line. From: Steven Cox [s...@renci.org] Sent: Thursday, July 03, 2014 9:21 PM To: user@spark.apache.org Subject: Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is

Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Soren Macbeth
Are the hadoop configuration files on the classpath for your mesos executors? On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote: ...and a real subject line. -- *From:* Steven Cox [s...@renci.org] *Sent:* Thursday, July 03, 2014 9:21 PM *To:*

RE: No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
They weren't. They are now and the logs look a bit better - like perhaps some serialization is completing that wasn't before. But I still get the same error periodically. Other thoughts? From: Soren Macbeth [so...@yieldbot.com] Sent: Thursday, July 03, 2014 9:54

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the Joining ... is slow message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example,

SparkSQL with Streaming RDD

2014-07-03 Thread Chang Lim
Would appreciate help on: 1. How to convert streaming RDD into JavaSchemaRDD 2. How to structure the driver program to do interactive SparkSQL Using Spark 1.0 with Java. I have steaming code that does upateStateByKey resulting in JavaPairDStream. I am using JavaDStream::compute(time) to get

Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Akhil Das
​Most likely you are missing the hadoop configuration files (present in conf/*.xml).​ Thanks Best Regards On Fri, Jul 4, 2014 at 7:38 AM, Steven Cox s...@renci.org wrote: They weren't. They are now and the logs look a bit better - like perhaps some serialization is completing that wasn't