Re: quick question

2016-08-25 Thread kant kodali
@Sivakumaran when you say create a web socket object in your spark code I assume you meant a spark "task" opening websocket connection from one of the worker machines to some node.js server in that case the websocket connection terminates after the spark task is completed right ? and when new data

Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Richard Siebeling
Hi Mich, thanks for the suggestion, I hadn't thought of that. We'll need to gather the statistics in two ways, incremental when new data arrives and over the complete set when aggregating or filtering (because I think it's difficult to gather statistics while aggregating or filtering). The

Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
Which RDBMS are you using here, and what is the data volume and frequency of pulling data off the RDBMS? Specifying these would help in giving better answers. Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle, so you can use that for better performance if using one of

Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen
Sqoop is probably the more mature tool for the job. It also just does one thing. The argument for doing it in Spark would be wanting to integrate it with a larger workflow. I imagine Sqoop would be more efficient and flexible for just the task of ingest, including continuously pulling deltas which

Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
This constant was added in Hadoop 2.3. Maybe you are using an older version? ~bhaskar On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh wrote: > Actually I started using Spark to import data from RDBMS (in this case > Oracle) after upgrading to Hive 2, running an

Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Actually I started using Spark to import data from RDBMS (in this case Oracle) after upgrading to Hive 2, running an import like below sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username scratchpad -P \ --query "select * from scratchpad.dummy2 where \

Is there anyway Spark UI is set to poll and refreshes itself

2016-08-25 Thread Mich Talebzadeh
Hi, This may be already there. A spark job opens up a UI on port specified by --conf "spark.ui.port=${SP}" that defaults to 4040. However, on UI one needs to refresh the page to see the progress. Can this be polled so it is refreshed automatically Thanks Dr Mich Talebzadeh LinkedIn *

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-25 Thread Marek Wiewiorka
Hi you can take a look at: https://github.com/hammerlab/spree it's a bit outdated but maybe it's still possible to use with some more recent Spark version. M. 2016-08-25 11:55 GMT+02:00 Mich Talebzadeh : > Hi, > > This may be already there. > > A spark job opens up a

Re: quick question

2016-08-25 Thread kant kodali
yes for now it will be Spark Streaming Job but later it may change. On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com wrote: Is this a Spark Streaming job? Regards, Sivakumaran S @Sivakumaran when you say create a web socket object in your spark code I assume you meant a spark

Re: quick question

2016-08-25 Thread Sivakumaran S
Is this a Spark Streaming job? Regards, Sivakumaran S > @Sivakumaran when you say create a web socket object in your spark code I > assume you meant a spark "task" opening websocket > connection from one of the worker machines to some node.js server in that > case the websocket connection

Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Mich Talebzadeh
Hi Richard, Windowing/Analytics for stats are pretty simple. Example import org.apache.spark.sql.expressions.Window val wSpec = Window.partitionBy('transactiontype).orderBy(desc("transactiondate"))

Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Hi, I am using Hadoop 2.6 hduser@rhes564: /home/hduser/dba/bin> *hadoop version*Hadoop 2.6.0 Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Hi, This UDF on substring works scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) => s.substring(start, end) } SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,StringType,Some(List(StringType, IntegerType, IntegerType))) I want something

Kafka message metadata with Dstreams

2016-08-25 Thread Pradeep
Hi All, I am using Dstreams to read Kafka topics. While I can read the messages fine, I also want to get metadata on the message such as offset, time it was put to topic etc.. Is there any Java Api to get this info. Thanks, Pradeep

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
Hi, Seems this just prevents writers from leaving partial data in a destination dir when jobs fail. In the previous versions of Spark, there was a way to directly write data in a destination though, Spark v2.0+ has no way to do that because of the critial issue on S3 (See: SPARK-10063). //

Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode? In any case, I tried running your earlier code and it worked for me on a 250MB csv: scoreModel <- function(parameters){ library(data.table) # I assume this should data.table dat <-

Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi , Released latest version of Receiver based Kafka Consumer for Spark Streaming. Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All Spark Versions Available at Spark Packages : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer Also at github :

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Tal Grynbaum
Is/was there an option similar to DirectParquetOutputCommitter to write json files to S3 ? On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro wrote: > Hi, > > Seems this just prevents writers from leaving partial data in a > destination dir when jobs fail. > In the

Re: quick question

2016-08-25 Thread Sivakumaran S
I am assuming that you are doing some calculations over a time window. At the end of the calculations (using RDDs or SQL), once you have collected the data back to the driver program, you format the data in the way your client (dashboard) requires it and write it to the websocket. Is your

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
afaik no. // maropu On Thu, Aug 25, 2016 at 9:16 PM, Tal Grynbaum wrote: > Is/was there an option similar to DirectParquetOutputCommitter to write > json files to S3 ? > > On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >>

Re: namespace quota not take effect

2016-08-25 Thread Ted Yu
This question should have been posted to user@ Looks like you were using wrong config. See: http://hbase.apache.org/book.html#quota See 'Setting Namespace Quotas' section further down. Cheers On Tue, Aug 23, 2016 at 11:38 PM, W.H wrote: > hi guys > I am testing the hbase

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Cinquegrana, Piero
I tested both in local and cluster mode and the '<<-' seemed to work at least for small data. Or am I missing something? Is there a way for me to test? If that does not work, can I use something like this? sc <- SparkR:::getSparkContext() bcStack <- SparkR:::broadcast(sc,stack) I realized that

Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread mdkhajaasmath
Hi Dibyendu, Looks like it is available in 2.0, we are using older version of spark 1.5 . Could you please let me know how to use this with older versions. Thanks, Asmath Sent from my iPhone > On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya > wrote: > >

Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi, This package is not dependant on any spefic Spark release and can be used with 1.5 . Please refer to "How To" section here : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer Also you will find more information in readme file how to use this package. Regards, Dibyendu On

Re: UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Ok I tried this def padString(s: String, chars: String, length: Int): String = | (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString + s padString: (s: String, chars: String, length: Int)String And use it like below: Example left pad the figure 12345.87 with 10

Re: Kafka message metadata with Dstreams

2016-08-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/api/java/index.html messageHandler receives a kafka MessageAndMetadata object. Alternatively, if you just need metadata information on a per-partition basis, you can use HasOffsetRanges

Pyspark SQL 1.6.0 write problem

2016-08-25 Thread Ethan Aubin
Hi, I'm having problems writing dataframes with pyspark 1.6.0. If I create a small dataframe like: sqlContext.createDataFrame(pandas.DataFrame.from_dict([{'x': 1}])).write.orc('test-orc') Only the _SUCCESS file in the output directory is written. The executor log shows the saved output of

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Steve Loughran
With Hadoop 2.7 or later, set spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true This switches to a no -rename version of the file output committer, is faster all round. You are still at risk of things going wrong

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
You can use Spark on Oracle as a query tool. It all depends on the mode of the operation. If you running Spark with yarn-client/cluster then you will need yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn). I have not gone and installed Yarn without installing Hadoop. What is

RE: How to output RDD to one file with specific name?

2016-08-25 Thread Stefan Panayotov
You can do something like: dbutils.fs.cp("/foo/part-0","/foo/my-data.csv") Stefan Panayotov, PhD spanayo...@outlook.com spanayo...@comcast.net Cell: 610-517-5586 Home: 610-355-0919 From: Gavin Yue

Perform an ALS with TF-IDF output (spark 2.0)

2016-08-25 Thread Pasquinell Urbani
Hi there I am performing a product recommendation system for retail. I have been able to compute the TF-IDF of user-items data frame in spark 2.0. Now I need to transform the TF-IDF output in a data frame with columns (user_id, item_id, TF_IDF_ratings) in order to perform an ALS. But I have no

Re: UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Thanks Mike. Can one turn the first example into a generic UDF similar to the output from below where 10 "0" are padded to the left of 123 def padString(id: Int, chars: String, length: Int): String = (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString + id.toString

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Are you trying to always add x numbers of digits / characters or are you trying to pad to a specific length? If the latter, try using format strings: // Pad to 10 0 characters val c = 123 f"$c%010d" // Result is 000123 // Pad to 10 total characters with 0's val c = 123.87 f"$c%010.2f" //

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems

Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate DataFrames. Since d1 is small enough, I broadcast it. What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each

How to output RDD to one file with specific name?

2016-08-25 Thread Gavin Yue
I am trying to output RDD to disk by rdd.coleasce(1).saveAsTextFile("/foo") It outputs to foo folder with a file with name: Part-0. Is there a way I could directly save the file as /foo/somename ? Thanks.

Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
The reason your second example works is because of a closure capture behavior. It should be ok for a small amount of data. You could also use SparkR:::broadcast but please keep in mind that is not public API we actively support. Thank you for the information on formula - I will test that out.

SparkStreaming + Flume: org.jboss.netty.channel.ChannelException: Failed to bind to: master60/10.0.10.60:31001

2016-08-25 Thread luohui20001
Hi there I have a flume cluster sending messages to SparkStreaming. I got an exception like below:16/08/25 23:00:54 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.jboss.netty.channel.ChannelException: Failed to bind to: master60/10.0.10.60:31001

Re: Incremental Updates and custom SQL via JDBC

2016-08-25 Thread Mich Talebzadeh
As far as I can tell Spark does not support update to ORC tables. This is because Spark needs to send heartbeat to Hive metadata and maintain in throughout DML transaction operation (delete, updates here) and that is not implemented. For the same token if you have performed DML on ORC table in

Re: Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Michael Gummelt
You have a space between "build" and "mvn" On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni wrote: > HI all > sorry for the partially off-topic, i hope there's someone on the list who > has tried the same and encountered similar issuse > > Ok so i have created a Docker file

Re: Converting DataFrame's int column to Double

2016-08-25 Thread Jestin Ma
How about this: df.withColumn("doubles", col("ints").cast("double")).drop("ints") On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni wrote: > hi all > i might be stuck in old code, but this is what i am doing to convert a > DF int column to Double > > val

Re: Converting DataFrame's int column to Double

2016-08-25 Thread Marco Mistroni
many tx Jestin! On Thu, Aug 25, 2016 at 10:13 PM, Jestin Ma wrote: > How about this: > > df.withColumn("doubles", col("ints").cast("double")).drop("ints") > > On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni > wrote: > >> hi all >> i might be

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3).

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
@Ofir @Sean very good points. @Mike We dont use Kafka or Hive and I understand that Zookeeper can do many things but for our use case all we need is for high availability and given the devops people frustrations here in our company who had extensive experience managing large clusters in the past

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
Mesos also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806 On Thu, Aug 25, 2016 at 1:55 PM, kant kodali wrote: > @Ofir @Sean very good points. > > @Mike We dont use Kafka

Re: Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Marco Mistroni
No i wont accept that :) I can't believe i have wasted 3 hrs for a space! Many thanks MIchael! kr On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt wrote: > You have a space between "build" and "mvn" > > On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni

Re: Using spark to distribute jobs to standalone servers

2016-08-25 Thread Igor Berman
imho, you'll need to implement custom rdd with your locality settings(i.e. custom implementation of discovering where each partition is located) + setting for spark.locality.wait On 24 August 2016 at 03:48, Mohit Jaggi wrote: > It is a bit hacky but possible. A lot depends

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://issues.apache.org/jira/browse/MESOS-3797 On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote: Mesos

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
@Mich I understand why I would need Zookeeper. It is there for fault tolerance given that spark is a master-slave architecture and when a mater goes down zookeeper will run a leader election algorithm to elect a new leader however DevOps hate Zookeeper they would be much happier to go with etcd &

Re: quick question

2016-08-25 Thread kant kodali
Your assumption is right (thats what I intend to do). My driver code will be in Java. The link sent by Kevin is a API reference to websocket. I understand how websockets works in general but my question was more geared towards seeing the end to end path on how front end dashboard gets updated in

Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
Hi, you need to cache df1 to prevent re-computation (including disk reads) because spark re-broadcasts data every sql execution. // maropu On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma wrote: > I have a DataFrame d1 that I would like to join with two separate >

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
Hi Kant, I trust the following would be of use. Big Data depends on Hadoop Ecosystem from whichever angle one looks at it. In the heart of it and with reference to points you raised about HDFS, one needs to have a working knowledge of Hadoop Core System including HDFS, Map-reduce algorithm and

Converting DataFrame's int column to Double

2016-08-25 Thread Marco Mistroni
hi all i might be stuck in old code, but this is what i am doing to convert a DF int column to Double val intToDoubleFunc:(Int => Double) = lbl => lbl.toDouble val labelToDblFunc = udf(intToDoubleFunc) val convertedDF = df.withColumn("SurvivedDbl", labelToDblFunc(col("Survived"))) is there a

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Is this what you're after? def padString(id: Int, chars: String, length: Int): String = chars * length + id.toString padString(123, "0", 10) res4: String = 00123 Mike On Thu, Aug 25, 2016 at 12:39 PM, Mich Talebzadeh wrote: > Thanks Mike. > > Can one

Running yarn with spark not working with Java 8

2016-08-25 Thread Anil Langote
Hi All, I have cluster with 1 master and 6 slaves which uses pre-built version of hadoop 2.6.0 and spark 1.6.2. I was running hadoop MR and spark jobs without any problem with openjdk 7 installed on all the nodes. However when I upgraded openjdk 7 to openjdk 8 on all nodes, spark submit and

Re: quick question

2016-08-25 Thread Sivakumaran S
A Spark job using a streaming context is an endless “while" loop till you kill it or specify when to stop. Initiate a TCP Server before you start the stream processing (https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_a_WebSocket_server_in_Java

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit the code/data duality so that the non-distributed data that you are sending out from the driver is actually paying a role more like code (or at least parameters.) What is sent from the driver to an Executer is then used

Re: quick question

2016-08-25 Thread kant kodali
@Sivakumaran Thanks a lot and this should conclude the thread! On Thu, Aug 25, 2016 12:39 PM, Sivakumaran S siva.kuma...@me.com wrote: A Spark job using a streaming context is an endless “while" loop till you kill it or specify when to stop. Initiate a TCP Server before you start the stream

Insert non-null values from dataframe

2016-08-25 Thread Selvam Raman
Hi , Dataframe: colA colB colC colD colE 1 2 3 4 5 1 2 3 null null 1 null null null 5 null null 3 4 5 I want to insert dataframe to nosql database, where null occupies values(Cassandra). so i have to insert the column which has non-null values in the row. Expected: Record 1: (1,2,3,4,5)

Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Marco Mistroni
HI all sorry for the partially off-topic, i hope there's someone on the list who has tried the same and encountered similar issuse Ok so i have created a Docker file to build an ubuntu container which inlcudes spark 2.0, but somehow when it gets to the point where it has to kick off ./build/mvn

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/ On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra wrote: > One way you can start to make this make more sense, Sean, is if you > exploit the code/data duality so that the non-distributed data that you are > sending out from the driver is

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
That's often not as important as you might think. It really only affects the loading of data by the first Stage. Subsequent Stages (in the same Job or even in other Jobs if you do it right) will use the map outputs, and will do so with good data locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
> You would lose the ability to process data closest to where it resides if you do not use hdfs. This isn't true. Many other data sources (e.g. Cassandra) support locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan guha wrote: > At the core of it map reduce relies heavily on

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread ayan guha
At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. S3 or NFS will not able to provide that. On 26 Aug 2016 07:49, "kant kodali" wrote: > yeah so its seems like its work

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
Without a distributed storage system, your application can only create data on the driver and send it out to the workers, and collect data back from the workers. You can't read or write data in a distributed way. There are use cases for this, but pretty limited (unless you're running on 1

How to do this pairing in Spark?

2016-08-25 Thread Rex X
1. Given following CSV file > $cat data.csv > > ID,City,Zip,Flag > 1,A,95126,0 > 2,A,95126,1 > 3,A,95126,1 > 4,B,95124,0 > 5,B,95124,1 > 6,C,95124,0 > 7,C,95127,1 > 8,C,95127,0 > 9,C,95127,1 (a) where "ID" above is a primary key (unique), (b) for

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
How about using ZFS? On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote: That's often not as important as you might think. It really only affects the loading of data by the first Stage. Subsequent Stages (in the same Job or even in other Jobs if you do it right) will

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard) On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote: How about