Re: Kafka client - specify offsets?

2014-06-16 Thread Tobias Pfeiffer
Hi, there are apparently helpers to tell you the offsets https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads, but I have no idea how to pass that to the Kafka stream consumer. I am interested in that as well.

Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Wei Da
Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei

Spark streaming with Redis? Working with large number of model objects at spark compute nodes.

2014-06-16 Thread tnegi
We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of streams. Since we do not know which spark node will process specific RDDs , we need to make these models available at each

What is the best way to handle transformations or actions that takes forever?

2014-06-16 Thread Peng Cheng
My transformations or actions has some external tool set dependencies and sometimes they just stuck somewhere and there is no way I can fix them. If I don't want the job to run forever, Do I need to implement several monitor threads to throws an exception when they stuck. Or the framework can

Re: wholeTextFiles not working with HDFS

2014-06-16 Thread littlebird
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Michael Cutler
Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is

Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
Hi, I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat. It went all well on my laptop (Accumulo MockInstance + Spark Local mode). But when I try to submit it to the yarn cluster, the yarn logs shows the following error message: 14/06/16 02:01:44 INFO

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Akhil Das
Hi Check in your driver programs Environment, (eg: http://192.168.1.39:4040/environment/). If you don't see this commons-codec-1.7.jar jar then that's the issue. Thanks Best Regards On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm trying to use Accumulo

Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
If you don't want to refactor your code, you can put your input into a test file. After the test runs, read the data from the output file you specified (probably want this to be a temp file and delete on exit). Of course, that is not really a unit test - Metei's suggestion is preferable (this is

Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
Thanks you all for advice including (1) using CMS GC (2) use multiple worker instance and (3) use Tachyon. I will try (1) and (2) first and report back what I found. I will also try JDK 7 with G1 GC. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T.

Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
BTW: nowadays a single machine with huge RAM (200G to 1T) is really common. With virtualization you lose some performance. It would be ideal to see some best practice on how to use Spark in these state-of-art machines... Best regards, Wei - Wei Tan, PhD

pyspark regression results way off

2014-06-16 Thread jamborta
Hi all, I am testing the regression methods (SGD) using pyspark. Tried to tune the parameters, but they are far off from the results obtained using R. Is there some way to set these parameters more efficiently? thanks, -- View this message in context:

Re: pyspark regression results way off

2014-06-16 Thread jamborta
forgot to mention that I'm running spark 1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Memory footprint of Calliope: Spark - Cassandra writes

2014-06-16 Thread Gerard Maas
Hi, I've been doing some testing with Calliope as a way to do batch load from Spark into Cassandra. My initial results are promising on the performance area, but worrisome on the memory footprint side. I'm generating N records of about 50 bytes each and using the UPDATE mutator to insert them

RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best,

Need some Streaming help

2014-06-16 Thread Yana Kadiyska
Like many people, I'm trying to do hourly counts. The twist is that I don't want to count per hour of streaming, but per hour of the actual occurrence of the event (wall clock, say -mm-dd HH). My thought is to make the streaming window large enough that a full hour of streaming data would fit

Fwd: spark streaming questions

2014-06-16 Thread Chen Song
Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what

Worker dies while submitting a job

2014-06-16 Thread Luis Ángel Vicente Sánchez
I'm playing with a modified version of the TwitterPopularTags example and when I tried to submit the job to my cluster, workers keep dying with this message: 14/06/16 17:11:16 INFO DriverRunner: Launch Command: java -cp

Re: pyspark regression results way off

2014-06-16 Thread DB Tsai
Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write scala code, you can try LBFGS optimizer which converges better than GD. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Accessing the per-key state maintained by updateStateByKey for transformation of JavaPairDStream

2014-06-16 Thread Gaurav Jain
Hello Spark Streaming Experts I have a use-case, where I have a bunch of log-entries coming in, say every 10 seconds (Batch-interval). I create a JavaPairDStream[K,V] from these log-entries. Now, there are two things I want to do with this JavaPairDStream: 1. Use key-dependent state (updated by

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread Xiangrui Meng
Someone is working on weighted regularization. Stay tuned. -Xiangrui On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA) fixed-term.congrui...@us.bosch.com wrote: Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai
Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My Blog:

RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Thank you! I'm really looking forward to that. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, June 16, 2014 11:19 AM To: user@spark.apache.org Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai
Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under package org.apache.spark.mllib Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread Congrui Yi
Hi DB, Thank you for the reply! I'm looking forward to this change, which surely adds much more flexibility to the optimizers, including whether or not the intercept should be penalized. Sincerely, Congrui Yi From: DB Tsai-2 [via Apache Spark User List]

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Bertrand Dechoux
I guess you have to understand the difference of architecture. I don't know much about C++ MPI but it is basically MPI whereas Spark is inspired from Hadoop MapReduce and optimised for reading/writing large amount of data with a smart caching and locality strategy. Intuitively, if you have a high

RE: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread Congrui Yi
Thank you! I'll try it out. From: DB Tsai-2 [via Apache Spark User List] [mailto:ml-node+s1001560n7686...@n3.nabble.com] Sent: Monday, June 16, 2014 11:32 AM To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA) Subject: Re: MLlib-a problem of example code for L-BFGS Hi Congrui, I mean create your own

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Luis Ángel Vicente Sánchez
Did you manage to make it work? I'm facing similar problems and this a serious blocker issue. spark-submit seems kind of broken to me if you can use it for spark-streaming. Regards, Luis 2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com: I am using Spark 1.0.0 compiled with Hadoop

pyspark-Failed to run first

2014-06-16 Thread Congrui Yi
Hi All, I am just trying to compare Scala and Python API in my local machine. Just tried to import a local matrix(1000 by 10, created in R) stored in a text file via textFile in pyspark. when I run data.first() it fails to present the line and give error messages including the next: Then I did

No Intercept for Python

2014-06-16 Thread Naftali Harris
Hi everyone, The Python LogisticRegressionWithSGD does not appear to estimate an intercept. When I run the following, the returned weights and intercept are both 0.0: from pyspark import SparkContext from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import

Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-16 Thread Piotr Kołaczkowski
We are having the same problem. We're running Spark 0.9.1 in standalone mode and on some heavy jobs workers become unresponsive and marked by master as dead, even though the worker process is still running. Then they never join the cluster again and cluster becomes essentially unusable until we

Re: pyspark serializer can't handle functions?

2014-06-16 Thread madeleine
Interesting! I'm curious why you use cloudpickle internally, but then use standard pickle to serialize RDDs? I'd like to create an RDD of functions because (I think) it's the most natural way to express my problem. I have a matrix of functions; I'm trying to find a low rank matrix that minimizes

Can't get Master Kerberos principal for use as renewer

2014-06-16 Thread Finamore A.
Hi, I'm a new user to Spark and I'm trying to integrate it in my cluster. It's a small set of nodes running CDH 4.7 with kerberos. The other services are fine with the authentication but I've some troubles with spark. First, I used the parcel available in cloudera manager (SPARK

Set comparison

2014-06-16 Thread SK
Hi, I have a Spark method that returns RDD[String], which I am converting to a set and then comparing it to the expected output as shown in the following code. 1. val expected_res = Set(ID1, ID2, ID3) // expected output 2. val result:RDD[String] = getData(input) //method returns RDD[String]

Re: spark master UI does not keep detailed application history

2014-06-16 Thread Andrew Or
Are you referring to accessing a SparkUI for an application that has finished? First you need to enable event logging while the application is still running. In Spark 1.0, you set this by adding a line to $SPARK_HOME/conf/spark-defaults.conf: spark.eventLog.enabled true Other than that, the

Re: Set comparison

2014-06-16 Thread Sean Owen
On Mon, Jun 16, 2014 at 10:09 PM, SK skrishna...@gmail.com wrote: The value returned by the method is almost same as expected output, but the verification is failing. I am not sure why the expected_res in Line 5 does not print the quotes even though Line 1 has them. Could that be the reason

Re: Set comparison

2014-06-16 Thread SK
In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the quotes during comparison. Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and that seems to work. thanks. -- View this message in context:

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
Ah, I see, interesting. CloudPickle is slower than the cPickle library, so that’s why we didn’t use it for data, but it should be possible to write a Serializer that uses it. Another thing you can do for this use case though is to define a class that represents your functions: class

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
With the help from the Accumulo guys, I probably know why. I'm using the binary distro of Spark and Base64 is from spark-assembly.jar and it probably uses an older version of commons-codec. I'll need to reinstall spark from source. Jianshi On Mon, Jun 16, 2014 at 9:18 PM, Akhil Das

spark with docker: errors with akka, NAT?

2014-06-16 Thread Mohit Jaggi
Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via boot2docker) fails with following logs. I am pointing the spark driver (which is running the example) to a spark cluster (driver is not

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Tom Vacek
Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage

Spark sql unable to connect to db2 hive metastore

2014-06-16 Thread Jenny Zhao
Hi, my hive configuration use db2 as it's metastore database, I have built spark with the extra step sbt/sbt assembly/assembly to include the dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf. when I ran : hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) got

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Gino Bustelo
+1 for this issue. Documentation for spark-submit are misleading. Among many issues, the jar support is bad. HTTP urls do not work. This is because spark is using hadoop's FileSystem class. You have to specify the jars twice to get things to work. Once for the DriverWrapper to laid your classes

Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly what you did in the modified version. Sent from my iPhone On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote: In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the

akka.FrameSize

2014-06-16 Thread Chen Jin
Hi all, I have run into a very interesting bug which is not exactly as same as Spark-1112. Here is how to reproduce the bug, I have one input csv file and use partitionBy function to create an RDD, say repartitionedRDD. The partitionBy function takes the number of partitions as a parameter such

Re: spark with docker: errors with akka, NAT?

2014-06-16 Thread Andre Schumacher
Hi, are you using the amplab/spark-1.0.0 images from the global registry? Andre On 06/17/2014 01:36 AM, Mohit Jaggi wrote: Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via