Re: How to specify file

2016-09-23 Thread Hemant Bhanawat
Check out the READEME on the following page. This is the csv connector that you are using. I think you need to specify the delimiter option. https://github.com/databricks/spark-csv Hemant Bhanawat www.snappydata.io On Fri, Sep 23, 2016 at 12:

Re: How to specify file

2016-09-23 Thread Aditya
Hi Sea, For using Spark SQL you will need to create DataFrame from the file and then execute select * on dataframe. In your case you will need to do something like this JavaRDD DF = context.textFile("path"); JavaRDD rowRDD3 = DF.map(new Function() { public Row call(

?????? How to specify file

2016-09-23 Thread Sea
Hi, Hemant, Aditya: I don't want to create temp table and write code, I just want to run sql directly on files "select * from csv.`/path/to/file`" -- -- ??: "Hemant Bhanawat";; : 2016??9??23??(??) 3:32 ??: "Sea"<261810...@qq

Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Hello there, I have Spark Application which refer to an external file ‘abc.drl’ and having unstructured data. Application is able to find this reference file if I run app in Local mode but in Yarn with Cluster mode, it is not able to find the file in the specified path. I tried with both local a

Re: How to specify file

2016-09-23 Thread Mich Talebzadeh
You can do the following with option("delimiter") .. val df = spark.read.option("header", false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya
Hi Abhishek, From your spark-submit it seems your passing the file as a parameter to the driver program. So now it depends what exactly you are doing with that parameter. Using --files option it will be available to all the worker nodes but if in your code if you are referencing using the spe

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Steve Loughran
On 23 Sep 2016, at 08:33, ABHISHEK mailto:abhi...@gmail.com>> wrote: at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhietc/abc.drl (No such file or directory) at java.io.FileI

UDAF collect_list: Hive Query or spark sql expression

2016-09-23 Thread Jason Mop
Hi Spark Team, I see most Hive function have been implemented by Spark SQL expression, but collect_list is still using Hive Query, will it also be implemented by Expression in future? any update? Cheers, Ming

Re: Apache Spark JavaRDD pipe() need help

2016-09-23 Thread शशिकांत कुलकर्णी
Thank you Jakob. I will try as suggested. Regards, Shashi On Fri, Sep 23, 2016 at 12:14 AM, Jakob Odersky wrote: > Hi Shashikant, > > I think you are trying to do too much at once in your helper class. > Spark's RDD API is functional, it is meant to be used by writing many > little transformati

Tuning Spark memory

2016-09-23 Thread tan shai
Hi, I am working with Spark 2.0, the job starts by sorting the input data and storing the output on HDFS. I am getting Out of memory errors, the solution was to increase the value of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the problem. But in the documentation I have found th

ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

2016-09-23 Thread muhammet pakyürek
i tried to connect cassandra via spark-cassandra-conenctor2.0.0 on pyspark but i get the error below i think it s related to pyspark/context.py but i dont know how?

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Thanks for your response Aditya and Steve. Steve: I have tried specifying both /tmp/filename in hdfs and local path but it didn't work. You may be write that Kie session is configured to access files from Local path. I have attached code here for your reference and if you find some thing wrong, p

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread Cody Koeninger
For Spark 2.0 there are two kafka artifacts, spark-streaming-kafka-0-10 (0.10 and later brokers only) and spark-streaming-kafka-0-8 (should work with 0.8 and later brokers). The docs explaining this were merged to master just after 2.0 released, so they haven't been published yet. There are usage

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya
Hi Abhishek, Try below spark submit. spark-submit --master yarn --deploy-mode cluster --files hdfs://abc.com:8020/tmp/abc.drl --class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl On Friday 23 Septe

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
I have tried with hdfs/tmp location but it didn't work. Same error. On 23 Sep 2016 19:37, "Aditya" wrote: > Hi Abhishek, > > Try below spark submit. > spark-submit --master yarn --deploy-mode cluster --files hdfs:// > abc.com:8020/tmp/abc.drl --class com.abc.StartMain > abc-0.0.1-SNAPSHOT-jar-w

Spark MLlib ALS algorithm

2016-09-23 Thread Roshani Nagmote
Hello, I was working on Spark MLlib ALS Matrix factorization algorithm and came across the following blog post: https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html Can anyone help me understanding what "s" scaling factor does and does it really give bett

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread sagarcasual .
Hi, Thanks for the response, The issue I am facing is only for the clustered Kafka 2.11 based version 0.10.0.1 and Spark 1.6.1 with following dependencies. org.apache.spark:spark-core_2.10:1.6.1 compile group: 'org.apache.spark', name: 'spark-streaming_2.10', version:'1.6.1' compile group: 'org.apa

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-23 Thread Dirceu Semighini Filho
Hi Felix, Just runned your code and it prints Pi is roughly 4.0 Here is the code that I used as you didn't show what a random is I used the nextInt() val n = math.min(10L * slices, Int.MaxValue).toInt // avoid overflow val count = context.sparkContext.parallelize(1 until n, slices).map

Can somebody remove this guy?

2016-09-23 Thread Dirceu Semighini Filho
Can somebody remove this guy from the list tod...@yahoo-inc.com Just sent a message to the list and received an mail from yahoo saying that this email doesn't exist anymore. This is an automatically generated message. tod...@yahoo-inc.com is no longer with Yahoo! Inc. Your message will not be fo

Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Dan Bikle
hello spark-world, I am new to spark and want to learn how to use it. I come from the Python world. I see an example at the url below: http://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param What would be an optimal way to run the above example? In the Pyt

Re: Open source Spark based projects

2016-09-23 Thread manasdebashiskar
check out spark packages https://spark-packages.org/ and you will find few awesome and a lot of super awesome projects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html Sent from the Apache Spark User List m

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
See the reference on shuffles , "Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors

databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Dan Bikle
hello world-of-spark, I am learning spark today. I want to understand the spark code in this repo: https://github.com/databricks/spark-csv In the README.md I see this info: Linking You can link against this library in your program at the following coordinates: Scala 2.10 groupId: com.databri

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mich Talebzadeh
Does this assume that Spark is running on the same hosts as HDFS? Hence does increasing the latency affects the network latency on Hadoop nodes as well in your tests? The best network results are achieved when Spark nodes share the same hosts as Hadoop or they happen to be on the same subnet. HT

Re: databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Holden Karau
So the good news is the csv library has been integrated into Spark 2.0 so you don't need to use that package. On the other hand if your in an older version you can included it using the standard sbt or maven package configuration. On Friday, September 23, 2016, Dan Bikle wrote: > hello world-of

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott
You can run Spark code using the command line or by creating a JAR file (via IntelliJ or other IDE); however, you may wish to try a Databricks Community Edition account instead. They offer Spark as a managed service, and you can run Spark commands one at a time via interactive notebooks. There are

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mark Hamstra
> > The best network results are achieved when Spark nodes share the same > hosts as Hadoop or they happen to be on the same subnet. > That's only true for those portions of a Spark execution pipeline that are actually reading from HDFS. If you're re-using an RDD for which the needed shuffle file

Running Spark master/slave instances in non Daemon mode

2016-09-23 Thread Jeff Puro
Hi, I recently tried deploying Spark master and slave instances to container based environments such as Docker, Nomad etc. There are two issues that I've found with how the startup scripts work. The sbin/start-master.sh and sbin/start-slave.sh start a daemon by default, but this isn't as compatibl

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ayan guha
You may try copying the file to same location on all nodes and try to read from that place On 24 Sep 2016 00:20, "ABHISHEK" wrote: > I have tried with hdfs/tmp location but it didn't work. Same error. > > On 23 Sep 2016 19:37, "Aditya" wrote: > >> Hi Abhishek, >> >> Try below spark submit. >> sp

With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Dan Bikle
spark-world, I am walking through the example here: https://github.com/databricks/spark-csv#scala-api The example complains if I try to write a DataFrame to an existing folder: *val selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv").option("h

Re: With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Yong Zhang
df.write.format(source).mode("overwrite").save(path) Yong From: Dan Bikle Sent: Friday, September 23, 2016 6:45 PM To: user@spark.apache.org Subject: With spark DataFrame, how to write to existing folder? spark-world, I am walking through the example here: h

Error in run multiple unit test that extends DataFrameSuiteBase

2016-09-23 Thread Jinyuan Zhou
After I created two test case that FlatSpec with DataFrameSuiteBase. But I got errors when do sbt test. I was able to run each of them separately. My test cases does use sqlContext to read files. Here is the exception stack. Judging from the exception, I may need to unregister RpcEndpoint after ea

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Have been playing around with configs to crack this. Adding them here where it would be helpful to others :) Number of executors and timeout seemed like the core issue. {code} --driver-memory 4G \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.maxExecutors=500 \ --con

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Is there anywhere I can help fix this ? I can see the requests being made in the yarn allocator. What should be the upperlimit of the requests made ? https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222 On Sat, Sep 24, 2016 at 10:2

Re: Tuning Spark memory

2016-09-23 Thread Takeshi Yamamuro
Hi, Currently, the memory fraction of shuffle and storage is automatically tuned by a memory manager. So, you do not need to care the parameter in most cases. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L24 // maropu On

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Hi Dhruve, thanks. I've solved the issue with adding max executors. I wanted to find some place where I can add this behavior in Spark so that user should not have to worry about the max executors. Cheers - Thanks, via mobile, excuse brevity. On Sep 24, 2016 1:15 PM, "dhruve ashar" wrote: > F

ideas on de duplication for spark streaming?

2016-09-23 Thread kant kodali
Hi Guys, I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying t

Re: Spark MLlib ALS algorithm

2016-09-23 Thread Nick Pentreath
The scale factor was only to scale up the number of ratings in the dataset for performance testing purposes, to illustrate the scalability of Spark ALS. It is not something you would normally do on your training dataset. On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote wrote: > Hello, > > I was wor