Reg:Reading a csv file with String label into labelepoint

2016-03-15 Thread Dharmin Siddesh J
Hi I am trying to read a csv with few double attributes and String Label . How can i convert it to labelpoint RDD so that i can run it with spark mllib classification algorithms. I have tried The LabelPoint Constructor (is available only for Regression ) but it accepts only double format label.

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread pppsunil
Have you looked at using Accumulable interface, Take a look at Spark documentation at http://spark.apache.org/docs/latest/programming-guide.html#accumulators it gives example of how to use vector type for accumalator, which might be very close to what you need -- View this message in context:

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot
Hi, As I cant add colmns from another Dataframe I am planning to my row coulmns to map of key and arrays As I am new to scala and spark I am trying like below // create an empty map import scala.collection.mutable.{ArrayBuffer => mArrayBuffer} var map = Map[Int,mArrayBuffer[Any]]() def

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-15 Thread Imre Nagi
Hi, I'm just trying to process the data that come from the kafka source in my spark streaming application. What I want to do is get the pair of topic and message in a tuple from the message stream. Here is my streams: val streams = KafkaUtils.createDirectStream[String, Array[Byte], >

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Thanks Mark and Jeff On Wed, Mar 16, 2016 at 7:11 AM, Mark Hamstra wrote: > Looks to me like the one remaining Stage would execute 19788 Task if all > of those Tasks succeeded on the first try; but because of retries, 19841 > Tasks were actually executed. Meanwhile,

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread sychungd
Hi Jeff, sorry forgot to mention that the same java code works fine if we replace the python pi.py file with the jar version of pi example. |-> |Jeff Zhang | | | |

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Saisai Shao
You cannot directly invoke Spark application by using yarn#client like what you mentioned, it is deprecated and not supported. you have to use spark-submit to submit a Spark application to yarn. Also here the specific problem is that you're invoking yarn#client to run spark app as yarn-client

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Jeff Zhang
Could you try yarn-cluster mode ? Make sure your cluster nodes can reach your client machine and no firewall. On Wed, Mar 16, 2016 at 10:54 AM, wrote: > > Hi all, > > We're trying to submit a python file, pi.py in this case, to yarn from java > code but this kept

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin
Hi, I'm running into consistent failures during a shuffle read while trying to do a group-by followed by a count aggregation (using the DataFrame API on Spark 1.5.2). The shuffle read (in stage 1) fails with org.apache.spark.shuffle.FetchFailedException: Failed to send RPC 7719188499899260109

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
It's same as hive thrift server. I believe kerberos is supported. On Wed, Mar 16, 2016 at 10:48 AM, ayan guha wrote: > so, how about implementing security? Any pointer will be helpful > > On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang wrote: > >> The

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
The spark thrift server allow you to run hive query in spark engine. It can be used as jdbc server. On Wed, Mar 16, 2016 at 10:42 AM, ayan guha wrote: > Sorry to be dumb-head today, but what is the purpose of spark thriftserver > then? In other words, should I view spark

Does parallelize and collect preserve the original order of list?

2016-03-15 Thread JoneZhang
Step1 List items = new ArrayList();items.addAll(XXX); javaSparkContext.parallelize(items).saveAsTextFile(output); Step2 final List items2 = ctx.textFile(output).collect(); Does items and items2 has the same order? Besh wishes. Thanks. -- View this message

PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-15 Thread craigiggy
I am having trouble with my standalone Spark cluster and I can't seem to find a solution anywhere. I hope that maybe someone can figure out what is going wrong so this issue might be resolved and I can continue with my work. I am currently attempting to use Python and the pyspark library to do

Re: S3 Zip File Loading Advice

2016-03-15 Thread Benjamin Kim
Hi Xinh, I tried to wrap it, but it still didn’t work. I got a "java.util.ConcurrentModificationException”. All, I have been trying and trying with some help of a coworker, but it’s slow going. I have been able to gather a list of the s3 files I need to download. ### S3 Lists ### import

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are skipped if the total is only 19788. On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra wrote: > It's not just if the RDD is explicitly cached, but also if the map outputs > for stages have been

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
I have submitted https://issues.apache.org/jira/browse/SPARK-13905 and a PR for it. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Wednesday, March 16, 2016 12:52 AM To: roni Cc: Sun, Rui ; user@spark.apache.org Subject: Re: sparkR issues ? Hi Roni,

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
It's not just if the RDD is explicitly cached, but also if the map outputs for stages have been materialized into shuffle files and are still accessible through the map output tracker. Because of that, explicitly caching RDD actions often gains you little or nothing, since even without a call to

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/AccumulatorSuite.scala FYI On Tue, Mar 15, 2016 at 4:29 PM, SRK wrote: > Hi, > > How do I add an accumulator for a Set in Spark? > > Thanks! > > > > -- > View this message in context: >

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Imre Nagi
Hi Cody, Can you give a bit example how to use mapPartitions with a switch on topic? I've tried, yet still didn't work. On Tue, Mar 15, 2016 at 9:45 PM, Cody Koeninger wrote: > The direct stream gives you access to the topic. The offset range for > each partition contains

Re: Spark Thriftserver

2016-03-15 Thread Jeff Zhang
spark thrift server is very similar with hive thrift server. You can use hive jdbc driver to access spark thrift server. AFAIK, all the features of hive thrift server are also available in spark thrift server. On Wed, Mar 16, 2016 at 8:39 AM, ayan guha wrote: > Hi All > >

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
Right, it is a little confusing here. dropTempTable actually means unregister here. It only deletes the metadata of this table from catalog. But you can still operate this table by using its dataframe. On Wed, Mar 16, 2016 at 8:27 AM, Andy Davidson < a...@santacruzintegration.com> wrote: >

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
that should read anything.sbt Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 16 March 2016 at 00:04,

Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Hi All, Spark UI Completed Jobs section shows below information, what is the skipped value shown for Stages and Tasks below. Job_IDDescriptionSubmittedDuration Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total 11 count

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
in mvn the build mvn package will look for a file called pom.xml in sbt the build sbt package will look for a file called anything.smt It works Keep it simple I will write a ksh script that will create both generic and sbt files on the fly in the correct directory (at the top of the tree) and

Re: what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Jeff Zhang
>>> sqlContext.registerDataFrameAsTable(df, "table1") >>> sqlContext.dropTempTable("table1") On Wed, Mar 16, 2016 at 7:40 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Thanks > > Andy > -- Best Regards Jeff Zhang

what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Thanks Andy

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
The artifactId in maven basically (in a simple case) corresponds to name in sbt. Note however that you will manually need to append the _scalaBinaryVersion to the artifactId in case you would like to build against multiple scala versions (otherwise maven will overwrite the generated jar with the

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
Feel free to adjust artifact Id and version in maven. They're under your control. > On Mar 15, 2016, at 4:27 PM, Mich Talebzadeh > wrote: > > ok Ted > > In sbt I have > > name := "ImportCSV" > version := "1.0" > scalaVersion := "2.10.4" > > which ends up in

How to add an accumulator for a Set in Spark

2016-03-15 Thread SRK
Hi, How do I add an accumulator for a Set in Spark? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-an-accumulator-for-a-Set-in-Spark-tp26510.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

what is the best practice to read configure file in spark streaming

2016-03-15 Thread yaoxiaohua
Hi guys, I'm using kafka+spark streaming do log analysis. Now my requirement is that the log alarm rules may change sometimes. Rules maybe like this: App=Hadoop,keywords=oom|Exception|error,threshold=10 The

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
ok Ted In sbt I have name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" which ends up in importcsv_2.10-1.0.jar as part of *target/scala-2.10/importcsv_2.**10-1.0.jar* In mvn I have 1.0 scala Does it matter? Dr Mich Talebzadeh LinkedIn *

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
1.0 ... scala On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh wrote: > An observation > > Once compiled with MVN the job submit works as follows: > > + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages > com.databricks:spark-csv_2.11:1.3.0 --class

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
An observation Once compiled with MVN the job submit works as follows: + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark:// 50.140.197.217:7077 --executor-memory=12G --executor-cores=12 --num-executors=2

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
Jacek is correct for using org.apache.spark.ml.recommendation.ALSModel If you are trying to save org.apache.spark.mllib.recommendation.MatrixFactorizationModel, then it is similar, but just a little different, see the example here

spark.ml : eval model outside sparkContext

2016-03-15 Thread Emmanuel
Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train on Spark but use my model on my workflow. In `spark.ml` it seems like the only way to eval is to use `transform` which only takes a DataFrame.To build a DataFrame

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Many thanks Ted and thanks for heads up Jakob Just these two changes to dependencies org.apache.spark spark-core*_2.10* 1.5.1 org.apache.spark spark-sql*_2.10* 1.5.1 [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0 [INFO]

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
Hi Mich, probably unrelated to the current error you're seeing, however the following dependencies will bite you later: spark-hive_2.10 spark-csv_2.11 the problem here is that you're using libraries built for different Scala binary versions (the numbers after the underscore). The simple fix here

Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Hi, I normally use sbt and using this sbt file works fine for me cat ImportCSV.sbt name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"

Re: Microsoft SQL dialect issues

2016-03-15 Thread Suresh Thalamati
You should be able to register your own dialect if the default mappings are not working for your scenario. import org.apache.spark.sql.jdbc JdbcDialects.registerDialect(MyDialect) Please refer to the JdbcDialects to find example of existing default dialect for your database or another

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
Hi Frank We have thousands of small files . Each file is between 6K to maybe 100k. Conductor looks interesting Andy From: Frank Austin Nothaft Date: Tuesday, March 15, 2016 at 11:59 AM To: Andrew Davidson Cc: "user @spark"

Re: Spark and KafkaUtils

2016-03-15 Thread Vinti Maheshwari
Hi Cody, I wanted to update my build.sbt which was working with kafka without giving any error, it may help other user if they face similar issue. name := "NetworkStreaming" version := "1.0" scalaVersion:= "2.10.5" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming-kafka" %

How to convert Parquet file to a text file.

2016-03-15 Thread Shishir Anshuman
I need to convert the parquet file generated by the spark to a text (csv preferably) file. I want to use the data model outside spark. Any suggestion on how to proceed?

Re: Docker configuration for akka spark streaming

2016-03-15 Thread David Gomez Saavedra
The issue is related to this https://issues.apache.org/jira/browse/SPARK-13906 .set("spark.rpc.netty.dispatcher.numThreads","2") seem to fix the problem On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra wrote: > I have updated the config since I realized the actor

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
You are quite right. I am getting this error while profiling my module to see what is the minimum resources I can use to achieve my SLA. My point is that if resource constraint creates this problem, then this issue is just waiting to happen in a larger scenario(Though the probability of happening

Re: Microsoft SQL dialect issues

2016-03-15 Thread Mich Talebzadeh
Hi, Can you please clarify what you are trying to achieve and I guess you mean Transact_SQL for MSSQL? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Microsoft SQL dialect issues

2016-03-15 Thread Andrés Ivaldi
Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having dialect problems I found this https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E That is what is happening to me, It's possible to define the

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X
I want to do a query based on a logic condition to query between two tables. select * from if(A>B, tableA, tableB) But "if" function in Hive cannot work within FROM above. Any idea how?

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 2016, at

Re: bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Jan Štěrba
First off, I would advise against having dots in column names, thats just playing with fire. Second the exception is really strange since spark is complaining about a completely unrelated column. I would like to see the df schema before the exception was thrown. -- Jan Sterba

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
The data in the title is different, so to correct the data in the column requires to find out what is the correct data and then replace. To find the correct data could be tedious but if some mechanism is in place which can help to group the partially matched data then it might help to do the

RE: [MARKETING] Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Iain Cundy
Hi Manas I saw a very similar problem while using mapWithState. Timeout on BlockManager remove leading to a stall. In my case it only occurred when there was a big backlog of micro-batches, combined with a shortage of memory. The adding and removing of blocks between new and old tasks was

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
Hi, I need to process some events in a specific order based on a timestamp, for each user in my data. I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user.

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu
I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet. r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM,

bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Emmanuel
In Spark 1.6 if I do (column name has dot in it, but is not a nested column): df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))org.apache.spark.sql.AnalysisException: cannot resolve 'raw.minOfDay' given input

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode} On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet wrote: > I think you need some sort of fuzzy join ? > Is it always

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea
Hi,manas: Maybe you can look at this bug: https://issues.apache.org/jira/browse/SPARK-13566 -- -- ??: "manas kar";; : 2016??3??15??(??) 10:48 ??: "Ted Yu"; :

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI
Hi, I have a pair RDD of the form: (mykey, (value1, value2)) How can I create a DataFrame having the schema [V1 String, V2 String] to store [value1, value2] and save it into a Parquet table named "mykey"? /createDataFrame()/ method takes an RDD and a schema (StructType) in parameters. The

Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin
Hi list, We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly for processing data in HDFS and kafka streaming. We are thinking of using mesos instead of yarn as the cluster resource manager so we can use docker container as the executor and makes deployment easier. But

Re: create hive context in spark application

2016-03-15 Thread Antonio Si
Thanks Akhil. Yes, spark-shell works fine. In my app, I have a Restful service and from the Restful service, I am calling the spark-api to do some hiveql. That's why I am not using spark-submit. Thanks. Antonio. On Tue, Mar 15, 2016 at 12:02 AM, Akhil Das wrote:

Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
Hi Roni, you can probably rename the as.data.frame in $SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running install-dev.sh On Tue, Mar 15, 2016 at 8:46 AM, roni wrote: > Hi , > Is there a work around for this? > Do i need to file a bug for this? > Thanks > -R

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, This is an interesting point of view. I thought the HashPartitioner works completely differently. Here's my understanding - the HashPartitioner defines how keys are distributed within a dataset between the different partitions, but play no role in assigning each partition for processing by

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu
Dear Spark Users and Developers, We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), a Portable Distributed XGBoost in Spark,

Re: sparkR issues ?

2016-03-15 Thread roni
Hi , Is there a work around for this? Do i need to file a bug for this? Thanks -R On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote: > It seems as.data.frame() defined in SparkR convers the versions in R base > package. > > We can try to see if we can change the

Re: sparkR issues ?

2016-03-15 Thread roni
Alex, No I have not defined he "dataframe" its the spark default Dataframe. That line is just casting Factor as datarame to send to the function. Thanks -R On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov wrote: > This seems to be a very unfortunate name collision. SparkR

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Once again, please use roles, there is no way that you have to specify the access keys in the URI under any situation. Please read Amazon documentation and they will say the same. The only situation when you use the access keys in URI is when you have not read the Amazon documentation :) Regards,

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar
Your input is skewed in terms of the default hash partitioner that is used. Your options are to use a custom partitioner that can re-distribute the data evenly among your executors. I think you will see the same behaviour when you use more executors. It is just that the data skew appears to be

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
There are many solutions to a problem. Also understand that sometimes your situation might be such. For ex what if you are accessing S3 from your Spark job running in your continuous integration server sitting in your data center or may be a box under your desk. And sometimes you are just trying

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2016-03-15 Thread jkukul
Hi Eric (or rather: anyone who's experiencing similar situation), I think your problem was, that the /--files/ parameter was provided after the application jar. Your command should have looked like this, instead: ./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master yarn-cluster

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Oh!!! What the hell Please never use the URI *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of pain, security issues, code maintenance issues and ofcourse something that Amazon strongly suggests that we do not use. Please use roles and you will not have to worry about

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
I am using spark 1.6. I am not using any broadcast variable. This broadcast variable is probably used by the state management of mapwithState ...Manas On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu wrote: > Which version of Spark are you using ? > > Can you show the code snippet

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, Yes, I'm running the executors with 8 cores each. I also have properly configured executor memory, driver memory, num execs and so on in submit cmd. I'm a long time spark user, please lets skip the dummy cmd configuration stuff and dive in the interesting stuff :) Another strange thing I've

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Ted Yu
Which version of Spark are you using ? Can you show the code snippet w.r.t. broadcast variable ? Thanks On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar wrote: > Hi, > I have a streaming application that takes data from a kafka topic and uses > mapwithstate. > After

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
You have a slash before the bucket name. It should be @. Regards Sab On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote: > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my file name is deneme.txt. But I am getting >

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma
By default spark uses 2 executors with one core each, have you allocated more executors using the command line args as - --num-executors 25 --executor-cores x ??? What do you mean by the difference between the nodes is huge ? Regards, Padma Ch On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Hi, Try starting your clusters with roles, and you will not have to configure, hard code anything at all. Let me know in case you need any help with this. Regards, Gourav Sengupta On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya wrote: > Hi Safak, > > I changed the Keys but

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks the maven structure is identical to sbt. just sbt file I will have to replace with pom.xml I will use your pom.xml to start with it. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Yes, sbt uses the same structure as maven for source files. > On Mar 15, 2016, at 1:53 PM, Mich Talebzadeh > wrote: > > Thanks the maven structure is identical to sbt. just sbt file I will have to > replace with pom.xml > > I will use your pom.xml to start with it.

Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors

Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi
Thanks Evan for the points. I had supposed what you said, but as I don't have enough experience maybe I was missing something, thanks for the answer!! On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan wrote: > Andres, > > A couple points: > > 1) If you look at my post, you can

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
You can build using maven from the command line as well. This layout should give you an idea and here are some resources - http://www.scala-lang.org/old/node/345 project/ pom.xml - Defines the project src/ main/ java/ - Contains

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
sounds like the layout is basically the same as sbt layout, the sbt file is replaced by pom.xml? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Installing Spark on Mac

2016-03-15 Thread Aida Tefera
Hi Jakob, sorry for my late reply I tried to run the below; came back with "netstat: lunt: unknown or uninstrumented protocol I also tried uninstalling version 1.6.0 and installing version1.5.2 with Java 7 and SCALA version 2.10.6; got the same error messages Do you think it would be worth me

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks again Is there anyway one can set this one up without eclipse much like what I did with sbt? I need to know the directory structure foe MVN project. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manasdebashiskar
Hi, I have a streaming application that takes data from a kafka topic and uses mapwithstate. After couple of hours of smooth running of the application I see a problem that seems to have stalled my application. The batch seems to have been stuck after the following error popped up. Has anyone

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread licl
HI, I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread 李承霖
HI, I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make

Re: Compress individual RDD

2016-03-15 Thread Nirav Patel
Thanks Sabarish, I thought of same. will try that. Hi Ted, good question. I guess one way is to have an api like `rdd.persist(storageLevel, compress)` where 'compress' can be true or false. On Tue, Mar 15, 2016 at 5:18 PM, Sabarish Sasidharan wrote: > It will compress

Spark work distribution among execs

2016-03-15 Thread Borislav Kapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors with

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ Once you have it setup, File -> New -> Other -> MavenProject -> Next / Finish. You’ll see a default POM.xml which you can modify / replace. Here is some documentation that should help:

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Great Chandeep. I also have Eclipse Scala IDE below scala IDE build of Eclipse SDK Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe I am no expert on Eclipse so if I create project called ImportCSV where do I need to put the pom file or how do I reference it please. My Eclipse runs on a

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Btw, just to add to the confusion ;) I use Maven as well since I moved from Java to Scala but everyone I talk to has been recommending SBT for Scala. I use the Eclipse Scala IDE to build. http://scala-ide.org/ Here is my sample PoM. You can add dependancies based on

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Ok. Sounds like opinion is divided :) I will try to build a scala app with Maven. When I build with SBT I follow this directory structure High level directory the package name like ImportCSV under ImportCSV I have a directory src and the sbt file ImportCSV.sbt in directory src I have main

Re: Compress individual RDD

2016-03-15 Thread Sabarish Sasidharan
It will compress only rdds with serialization enabled in the persistence mode. So you could skip _SER modes for your other rdds. Not perfect but something. On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote: > Hi, > > I see that there's following spark config to compress an RDD.

Re: Hive Query on Spark fails with OOM

2016-03-15 Thread Sabarish Sasidharan
Yes, I suggested increasing shuffle partitions to address this problem. The other suggestion to increase shuffle fraction was not for this but makes sense given that you are reserving all that memory and doing nothing with it. By diverting more of it for shuffles you can help improve your shuffle

Re: reading file from S3

2016-03-15 Thread Yasemin Kaya
Hi Safak, I changed the Keys but there is no change. Best, yasemin 2016-03-15 12:46 GMT+02:00 Şafak Serdar Kapçı : > Hello Yasemin, > Maybe your key id or access key has special chars like backslash or > something. You need to change it. > Best Regards, > Safak. > >

Re: Compress individual RDD

2016-03-15 Thread Ted Yu
Looks like there is no such capability yet. How would you specify which rdd's to compress ? Thanks > On Mar 15, 2016, at 4:03 AM, Nirav Patel wrote: > > Hi, > > I see that there's following spark config to compress an RDD. My guess is it > will compress all RDDs of

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ted Yu
I did a quick search but haven't found JIRA in this regard. If configuration is separate from checkpoint data, more use cases can be accommodated. > On Mar 15, 2016, at 2:21 AM, Saisai Shao wrote: > > Currently configuration is a part of checkpoint data, and when

Compress individual RDD

2016-03-15 Thread Nirav Patel
Hi, I see that there's following spark config to compress an RDD. My guess is it will compress all RDDs of a given SparkContext, right? If so, is there a way to instruct spark context to only compress some rdd and leave others uncompressed ? Thanks spark.rdd.compress false Whether to compress

  1   2   >