How to add jar with SparkSQL HiveContext?

2014-06-17 Thread Earthson
I have a problem with add jar command hql(add jar /.../xxx.jar) Error: Exception in thread main java.lang.AssertionError: assertion failed: No plan for AddJar ... How could I do this job with HiveContext, I can't find any api to do it. Does SparkSQL with Hive support UDF/UDAF? -- View this

Re: How to add jar with SparkSQL HiveContext?

2014-06-17 Thread Michael Armbrust
Can you try this in master? You are likely running into SPARK-2128 https://issues.apache.org/jira/browse/SPARK-2128. Michael On Mon, Jun 16, 2014 at 11:41 PM, Earthson earthson...@gmail.com wrote: I have a problem with add jar command hql(add jar /.../xxx.jar) Error: Exception in

Contribution to Spark MLLib

2014-06-17 Thread Jayati
Hello, I wish to contribute some algorithms to the MLLib of Spark but at the same time wanted to make sure that I don't try something redundant. Will it be okay with you to let me know the set of algorithms which aren't there in your road map in the near future ? Also, can I use Java to write

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-06-17 Thread qingyang li
hi, steven, have you resolved this problem? i encounter the same problem, too. 2014-04-18 3:48 GMT+08:00 Sean Owen so...@cloudera.com: Oh dear I read this as a build problem. I can build with the latest Java 7, including those versions of Spark and Mesos, no problem. I did not deploy

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-06-17 Thread qingyang li
i am using spark 0.9.1 , mesos 0.19.0 and tachyon 0.4.1 , is spark0.9.1 compatiable with mesos0.19.0? 2014-06-17 15:50 GMT+08:00 qingyang li liqingyang1...@gmail.com: hi, steven, have you resolved this problem? i encounter the same problem, too. 2014-04-18 3:48 GMT+08:00 Sean Owen

Re: Can't get Master Kerberos principal for use as renewer

2014-06-17 Thread Finamore A.
Update. I've reconfigured the environment to use Spark 1.0.0 and the example finally worked! :) The different for me was that Spark 1.0.0 requires only to specify the hadoop conf dir (HADOOP_CONF_DIR=/etc/hadoop/conf/) I guess that with 0.9 there were problems in spotting this dir...but I'm not

Re: pyspark regression results way off

2014-06-17 Thread jamborta
Thanks, will try normalising it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Michael Cutler
Admittedly getting Spark Streaming / Kafka working for the first time can be a bit tricky with the web of dependencies that get pulled in. I've taken the KafkaWorkCount example from the Spark project and set up a simple standalone SBT project that shows you how to get it working and using

Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
Hi, I've stuck using either yarn-client or standalone-client mode. Either will stuck when I submit jobs, the last messages it printed were: ... 14/06/17 02:37:17 INFO spark.SparkContext: Added JAR file:/x/home/jianshuang/tmp/lib/commons-vfs2.jar at

Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Michael Armbrust
First a clarification: Spark SQL does not talk to HiveServer2, as that JDBC interface is for retrieving results from queries that are executed using Hive. Instead Spark SQL will execute queries itself by directly accessing your data using Spark. Spark SQL's Hive module can use JDBC to connect

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
For standalone-cluster mode, there's a scala.MatchError. Also it looks like the --jars configurations are not passed to the driver/worker node? (also copying from file:/path doesn't seem correct, shouldn't it copy form http://master/path ?) ... 14/06/17 04:15:30 INFO Worker: Asked to launch

news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hello, I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though the number of training examples used in the evaluation is just 1,000. It works fine for the dataset *news20.binary.1000* that has 178,560 features.

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context:

join operation is taking too much time

2014-06-17 Thread MEETHU MATHEW
 Hi all, I want  to do a recursive leftOuterJoin between an RDD (created from  file) with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 30 diff files in each iteration of a loop) varying from 1 to 6 million rows. When I run it for 5 RDDs,its running successfully  in

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Xusen Yin
Hi Sguj and littlebird, I'll try to fix it tomorrow evening and the day after tomorrow, because I am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience to you. Would you mind to write an issue on Spark JIRA? 2014-06-17 20:55 GMT+08:00 Sguj tpcome...@yahoo.com: I didn't

Execution stalls in LogisticRegressionWithSGD

2014-06-17 Thread Bharath Ravi Kumar
Hi, (Apologies for the long mail, but it's necessary to provide sufficient details considering the number of issues faced.) I'm running into issues testing LogisticRegressionWithSGD a two node cluster (each node with 24 cores and 16G available to slaves out of 24G on the system). Here's a

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
After playing a bit, I have been able to create a fatjar this way: lazy val rootDependencies = Seq( org.apache.spark %% spark-core % 1.0.0 % provided, org.apache.spark %% spark-streaming % 1.0.0 % provided, org.apache.spark %% spark-streaming-twitter % 1.0.0

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Jacob Eisinger
Long story [1] short, akka opens up dynamic, random ports for each job [2]. So, simple NAT fails. You might try some trickery with a DNS server and docker's --net=host . [1] http://apache-spark-user-list.1001560.n3.nabble.com/Comprehensive-Port-Configuration-reference-tt5384.html#none [2]

Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark environment in 1.0.0, and all of the things I've found tell me I have export something in Java Opts, which is deprecated in 1.0.0, or in increase the spark.executor.memory, which is at 6G. I'm only trying to process about

Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark environment in 1.0.0, and all of the things I've found tell me I have export something in Java Opts, which is deprecated in 1.0.0, or in increase the spark.executor.memory, which is at 6G. I'm only trying to process about

Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread abhiguruvayya
Try repartitioning the RDD using coalsce(int partitions) before performing any transforms. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7736.html Sent from the Apache Spark User List mailing

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark streaming questions

2014-06-17 Thread Chen Song
Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what

Re: Memory footprint of Calliope: Spark - Cassandra writes

2014-06-17 Thread Andrew Ash
Gerard, Strings in particular are very inefficient because they're stored in a two-byte format by the JVM. If you use the Kryo serializer and have use StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for ASCII-like strings will take half the space. Andrew On Tue, Jun 17,

Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I am creating around 10 executors with 12 cores and 7g memory, but when i launch a task not all executors are being used. For example if my job has 9 tasks, only 3 executors are being used with 3 task each and i believe this is making my app slower than map reduce program for the same use case.

Re: Worker dies while submitting a job

2014-06-17 Thread Luis Ángel Vicente Sánchez
Ok... I was checking the wrong version of that file yesterday. My worker is sending a DriverStateChanged(_, DriverState.FAILED, _) but there is no case branch for that state and the worker is crashing. I still don't know why I'm getting a FAILED state but I'm sure that should kill the actor due to

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Andrew Or
Standalone-client mode is not officially supported at the moment. For standalone-cluster and yarn-client modes, however, they should work. For both modes, are you running spark-submit from within the cluster, or outside of it? If the latter, could you try running it from within the cluster and

Re: join operation is taking too much time

2014-06-17 Thread Andrew Or
How long does it get stuck for? This is a common sign for the OS thrashing due to out of memory exceptions. If you keep it running longer, does it throw an error? Depending on how large your other RDD is (and your join operation), memory pressure may or may not be the problem at all. It could be

Re: Worker dies while submitting a job

2014-06-17 Thread Luis Ángel Vicente Sánchez
I have been able to submit a job successfully but I had to config my spark job this way: val sparkConf: SparkConf = new SparkConf() .setAppName(TwitterPopularTags) .setMaster(spark://int-spark-master:7077) .setSparkHome(/opt/spark)

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
I have been able to submit a job successfully but I had to config my spark job this way: val sparkConf: SparkConf = new SparkConf() .setAppName(TwitterPopularTags) .setMaster(spark://int-spark-master:7077) .setSparkHome(/opt/spark)

Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Jenny Zhao
Thanks Michael! as I run it using spark-shell, so I added both jars through bin/spark-shell --jars options. I noticed if I don't pass these jars, it complains it couldn't find the driver, if I pass them through --jars options, it complains there is no suitable driver. Regards. On Tue, Jun 17,

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Peng Cheng
I've tried enabling the speculative jobs, this seems partially solved the problem, however I'm not sure if it can handle large-scale situations as it only start when 75% of the job is finished. -- View this message in context:

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Can some one help me with this. Any help is appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7753.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Daniel Darabos
I think you need to implement a timeout in your code. As far as I know, Spark will not interrupt the execution of your code as long as the driver is connected. Might be an idea though. On Tue, Jun 17, 2014 at 7:54 PM, Peng Cheng pc...@uow.edu.au wrote: I've tried enabling the speculative jobs,

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Gino Bustelo
Luis' experience validates what I'm seeing. You have to still set the properties in the SparkConf for the context to work. For example, master URL and jars are specified again in the app. Gino B. On Jun 17, 2014, at 12:05 PM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I

Re: Executors not utilized properly.

2014-06-17 Thread Sean Owen
It sounds like your job has 9 tasks and all are executing simultaneously in parallel. This is as good as it gets right? Are you asking how to break the work into more tasks, like 120 to match your 10*12 cores? Make your RDD have more partitions. For example the textFile method can override the

Re: join operation is taking too much time

2014-06-17 Thread Daniel Darabos
I've been wondering about this. Is there a difference in performance between these two? val rdd1 = sc.textFile(files.mkString(,)) val rdd2 = sc.union(files.map(sc .textFile(_))) I don't know about your use-case, Meethu, but it may be worth trying to see if reading all the files into one RDD

Problems running Spark job on mesos in fine-grained mode

2014-06-17 Thread Sébastien Rainville
Hi, I'm having trouble running spark on mesos in fine-grained mode. I'm running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which most of the time, but not always, cause the job to fail. The same code is running fine in coarse-grained mode. I see the following exceptions in the

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I did try creating more partitions by overriding the default number of partitions determined by HDFS splits. Problem is, in this case program will run for ever. I have same set of inputs for map reduce and spark. Where map reduce is taking 2 mins, spark is taking 5 min to complete the job. I

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Here is follow-up to the previous evaluation. aggregate at GradientDescent.scala:178 never finishes at https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178 We confirmed, by -verbose:gc, that GC is not happening during the

Spark streaming RDDs to Parquet records

2014-06-17 Thread maheshtwc
Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Makoto, How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms. Btw, could you try the tree branch in my repo? https://github.com/mengxr/spark/tree/tree I used tree aggregate in this branch. It should help with the scalability.

Spark SQL: No function to evaluate expression

2014-06-17 Thread Zuhair Khayyat
Dear all, I am trying to run the following query on Spark SQL using some custom TPC-H tables with standalone Spark cluster configuration: SELECT * FROM history a JOIN history b ON a.o_custkey = b.o_custkey WHERE a.c_address b.c_address; Unfortunately I get the following error during execution:

Re: Contribution to Spark MLLib

2014-06-17 Thread Xiangrui Meng
Hi Jayati, Thanks for asking! MLlib algorithms are all implemented in Scala. It makes us easier to maintain if we have the implementations in one place. For the roadmap, please visit http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features planned for v1.1. Before contributing new

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Bharath, Thanks for posting the details! Which Spark version are you using? Best, Xiangrui On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, (Apologies for the long mail, but it's necessary to provide sufficient details considering the number of issues

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 4:58), Xiangrui Meng wrote: How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms. The training data news20.random.1000 is small and thus only 2 partitions are used by the default. val training =

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM,

Re: Executors not utilized properly.

2014-06-17 Thread Jey Kottalam
Hi Abhishek, Where mapreduce is taking 2 mins, spark is taking 5 min to complete the job. Interesting. Could you tell us more about your program? A code skeleton would certainly be helpful. Thanks! -Jey On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I did

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi DB, treeReduce (treeAggregate) is a feature I'm testing now. It is a compromise between current reduce and butterfly allReduce. The former runs in linear time on the number of partitions, the latter introduces too many dependencies. treeAggregate with depth = 2 should run in O(sqrt(n)) time,

Unit test failure: Address already in use

2014-06-17 Thread SK
Hi, I have 3 unit tests (independent of each other) in the /src/test/scala folder. When I run each of them individually using: sbt test-only test, all the 3 pass the test. But when I run them all using sbt test, then they fail with the warning below. I am wondering if the binding exception

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Makoto, Are you using Spark 1.0 or 0.9? Could you go to the executor tab of the web UI and check the driver's memory? treeAggregate is not part of 1.0. Best, Xiangrui On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng men...@gmail.com wrote: Hi DB, treeReduce (treeAggregate) is a feature I'm

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, Does it mean that mapPartition and then reduce shares the same behavior as aggregate operation which is O(n)? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17,

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 6:03), Xiangrui Meng wrote: Are you using Spark 1.0 or 0.9? Could you go to the executor tab of the web UI and check the driver's memory? I am using Spark 1.0. 588.8 MB is allocated for driver RDDs. I am setting SPARK_DRIVER_MEMORY=2g in the conf/spark-env.sh. The

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Debasish Das
Xiangrui, Could you point to the JIRA related to tree aggregate ? ...sounds like the allreduce idea... I would definitely like to try it on our dataset... Makoto, I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread contractor
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com Reply-To:

Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Jenny Zhao
finally got it work out, mimicked how spark added datanucleus jars in compute-classpath.sh, and added the db2jcc*.jar in the classpath, it works now. Thanks! On Tue, Jun 17, 2014 at 10:50 AM, Jenny Zhao linlin200...@gmail.com wrote: Thanks Michael! as I run it using spark-shell, so I added

Best practices for removing lineage of a RDD or Graph object?

2014-06-17 Thread dash
If a RDD object have non-empty .dependencies, does that means it have lineage? How could I remove it? I'm doing iterative computing and each iteration depends on the result computed in previous iteration. After several iteration, it will throw StackOverflowError. At first I'm trying to use

Why MLLib classes are so badly organized?

2014-06-17 Thread frol
Can anybody explain WHY: 1) LabeledPoint is in regression/LabeledPoint.scala? This cause import regression modules from classification modules. 2) Vector and SparseVector are in linalg? OK. GeneralizedLinearModel is in regression/GeneralizedLinearAlgorithm.scala? Really? 3) LinearModel is in

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Mohit Jaggi
I am using cutting edge code from git but doing my own sbt assembly. On Mon, Jun 16, 2014 at 10:28 PM, Andre Schumacher schum...@icsi.berkeley.edu wrote: Hi, are you using the amplab/spark-1.0.0 images from the global registry? Andre On 06/17/2014 01:36 AM, Mohit Jaggi wrote: Hi

Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Jeremy Lee
Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Michael Armbrust
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Thanks Krishna. Seems like you have

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
DB, Yes, reduce and aggregate are linear. Makoto, dense vectors are used to in aggregation. If you have 32 partitions and each one sending a dense vector of size 1,354,731 to master. Then the driver needs 300M+. That may be the problem. Which deploy mode are you using, standalone or local?

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I found the main reason to be that i was using coalesce instead of repartition. coalesce was shrinking the portioning so the number of tasks were very less to be executed by all of the executors. Can you help me in understudying when to use coalesce and when to use repartition. In application

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Aaron Davidson
I remember having to do a similar thing in the spark docker scripts for testing purposes. Were you able to modify the /etc/hosts directly? I remember issues with that as docker apparently mounts it as part of its read-only filesystem. On Tue, Jun 17, 2014 at 4:36 PM, Mohit Jaggi

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Makoto, please use --driver-memory 8G when you launch spark-shell. -Xiangrui On Tue, Jun 17, 2014 at 4:49 PM, Xiangrui Meng men...@gmail.com wrote: DB, Yes, reduce and aggregate are linear. Makoto, dense vectors are used to in aggregation. If you have 32 partitions and each one sending a

Re: Executors not utilized properly.

2014-06-17 Thread Aaron Davidson
repartition() is actually just an alias of coalesce(), but which the shuffle flag to set to true. This shuffle is probably what you're seeing as taking longer, but it is required when you go from a smaller number of partitions to a larger. When actually decreasing the number of partitions,

Issue while trying to aggregate with a sliding window

2014-06-17 Thread Hatch M
Trying to aggregate over a sliding window, playing with the slide duration. Playing around with the slide interval I can see the aggregation works but mostly fails with the below error. The stream has records coming in at 100ms. JavaPairDStreamString, AggregateObject aggregatedDStream =

Re: Issue while trying to aggregate with a sliding window

2014-06-17 Thread onpoq l
There is a bug: https://github.com/apache/spark/pull/961#issuecomment-45125185 On Tue, Jun 17, 2014 at 8:19 PM, Hatch M hatchman1...@gmail.com wrote: Trying to aggregate over a sliding window, playing with the slide duration. Playing around with the slide interval I can see the aggregation

Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Perfect!! That makes so much sense to me now. Thanks a ton -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7793.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-17 Thread Bharath Ravi Kumar
Hi Xiangrui , I'm using 1.0.0. Thanks, Bharath On 18-Jun-2014 1:43 am, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, Thanks for posting the details! Which Spark version are you using? Best, Xiangrui On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:

Re: Spark SQL: No function to evaluate expression

2014-06-17 Thread Tobias Pfeiffer
The error message *means* that there is no column called c_address. However, maybe it's a bug with Spark SQL not understanding the a.c_address syntax. Can you double-check the column name is correct? Thanks Tobias On Wed, Jun 18, 2014 at 5:02 AM, Zuhair Khayyat zuhair.khay...@gmail.com wrote:

Spark Streaming Example with CDH5

2014-06-17 Thread manas Kar
Hi Spark Gurus, I am trying to compile a spark streaming example with CDH5 and having problem compiling it. Has anyone created an example spark streaming using CDH5(preferably Spark 0.9.1) and would be kind enough to share the build.sbt(.scala) file?(or point to their example on github). I know

Re: NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-17 Thread bdamos
Hi, I think this is a bug in Spark, because changing my program to using a main method instead of using the App trait fixes this problem. I've filed this as SPARK-2175, apologies if this turns out to be a duplicate. https://issues.apache.org/jira/browse/SPARK-2175 Regards, Brandon. -- View

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
Hi Andrew, I submitted that within the cluster. Looks like the standalone-cluster mode didn't put the jars to its http server, and passed the file:/... to the driver node. That's why the driver node couldn't find the jars. However, I copied my files to all slaves, it still didn't work, see my

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-17 Thread Bharath Ravi Kumar
Couple more points: 1)The inexplicable stalling of execution with large feature sets appears similar to that reported with the news-20 dataset: http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3c53a03542.1010...@gmail.com%3E 2) The NPE trying to call mapToPair convert an RDDLong,

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Mohit Jaggi
I used --privileged to start the container and then unmounted /etc/hosts. Then I created a new /etc/hosts file On Tue, Jun 17, 2014 at 4:58 PM, Aaron Davidson ilike...@gmail.com wrote: I remember having to do a similar thing in the spark docker scripts for testing purposes. Were you able to

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 8:49), Xiangrui Meng wrote: Makoto, dense vectors are used to in aggregation. If you have 32 partitions and each one sending a dense vector of size 1,354,731 to master. Then the driver needs 300M+. That may be the problem. It seems that it could cuase certain problems

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Aaron Davidson
Yup, alright, same solution then :) On Tue, Jun 17, 2014 at 7:39 PM, Mohit Jaggi mohitja...@gmail.com wrote: I used --privileged to start the container and then unmounted /etc/hosts. Then I created a new /etc/hosts file On Tue, Jun 17, 2014 at 4:58 PM, Aaron Davidson ilike...@gmail.com

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On

rdd.cache() is not faster?

2014-06-17 Thread Wei Tan
Hi, I have a 40G file which is a concatenation of multiple documents, I want to extract two features (title and tables) from each doc, so the program is like this: - val file = sc.textFile(/path/to/40G/file) //file.cache() //to

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1

Wildcard support in input path

2014-06-17 Thread Jianshi Huang
It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog:

Re: Wildcard support in input path

2014-06-17 Thread MEETHU MATHEW
Hi Jianshi, I have used wild card characters (*) in my program and it worked.. My code was like this b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*)   Thanks Regards, Meethu M On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: It would be

question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-17 Thread santhoma
Hi, This is about spark 0.9. I have a 3 node spark cluster. I want to add a locally available jarfile (present on all nodes) to the SPARK_CLASPATH variable in /etc/spark/conf/spark-env.sh so that all nodes can access it. Question is, should I edit 'spark-env.sh' on all nodes to add the jar ?

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Jeremy Lee
I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option on the spark-ec2 command. Is it as easy as concatenating your string as the value? On cluster management GUIs... I've been looking around

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Actually you'll just want to clone the 1.0 branch then use the spark-ec2 script in there to launch your cluster. The --spark-git-repo flag is if you want to launch with a different version of Spark on the cluster. In your case you just need a different version of the launch script itself, which

Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell
These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Jianshi, I have used wild card characters (*) in my program and it

Re: Wildcard support in input path

2014-06-17 Thread Andrew Ash
In Spark you can use the normal globs supported by Hadoop's FileSystem, which are documented here: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path) On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Re: Spark SQL: No function to evaluate expression

2014-06-17 Thread Michael Armbrust
Yeah, sorry that error message is not very intuitive. There is already a JIRA open to make it better: SPARK-2059 https://issues.apache.org/jira/browse/SPARK-2059 Also, a bug has been fixed in master regarding attributes that contain _. So if you are running 1.0 you might try upgrading. On

Re: Un-serializable 3rd-party classes (Spark, Java)

2014-06-17 Thread Matei Zaharia
There are a few options: - Kryo might be able to serialize these objects out of the box, depending what’s inside them. Try turning it on as described at http://spark.apache.org/docs/latest/tuning.html. - If that doesn’t work, you can create your own “wrapper” objects that implement

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway

Re: Issue while trying to aggregate with a sliding window

2014-06-17 Thread Hatch M
Thanks! Will try to get the fix and retest. On Tue, Jun 17, 2014 at 5:30 PM, onpoq l onpo...@gmail.com wrote: There is a bug: https://github.com/apache/spark/pull/961#issuecomment-45125185 On Tue, Jun 17, 2014 at 8:19 PM, Hatch M hatchman1...@gmail.com wrote: Trying to aggregate over a