about broadcast join of base table in spark sql

2017-06-28 Thread paleyl
Hi All, Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote A = A.join(B,A("key1") === B("key2"),"left") but I found that A is not broadcast out, as the shuffle size is still very large. I guess this is a designed

Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna
Hi , I was able to successfully build the project(source code), from intellij. But when i try to run any of the examples present in $SPARK_HOME/examples folder , i am getting different errors for different example jobs. example: for structuredkafkawordcount example, Exception in thread "main"

SparkSQL to read XML Blob data to create multiple rows

2017-06-28 Thread Talap, Amol
Hi: We are trying to parse XML data to get below output from given input sample. Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output. Input Data from Oracle Table XMLBlob: SequenceID Name City XMLComment 1 Amol Kolhapur

Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread satyajit vegesna
Hi All, I am trying too build Kafka-0-10-sql module under external folder in apache spark source code. Once i generate jar file using, build/mvn package -DskipTests -pl external/kafka-0-10-sql i get jar file created under external/kafka-0-10-sql/target. And try to run spark-shell with jars

Re: Spark job profiler results showing high TCP cpu time

2017-06-28 Thread Reth RM
I am using visual vm: https://github.com/krasa/VisualVMLauncher @Marcelo, thank you for the reply, that was helpful. On Fri, Jun 23, 2017 at 12:48 PM, Eduardo Mello wrote: > what program do u use to profile Spark? > > On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread Shixiong(Ryan) Zhu
"--package" will add transitive dependencies that are not "$SPARK_HOME/external/kafka-0-10-sql/target/*.jar". > i have tried building the jar with dependencies, but still face the same error. What's the command you used? On Wed, Jun 28, 2017 at 12:00 PM, satyajit vegesna <

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread satyajit vegesna
Have updated the pom.xml in external/kafka-0-10-sql folder, in yellow , as below, and have run the command build/mvn package -DskipTests -pl external/kafka-0-10-sql which generated spark-sql-kafka-0-10_2.11-2.3.0-SNAPSHOT-jar-with-dependencies.jar http://maven.apache.org/POM/4.0.0;

Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna
Hi All, When i try to build source code of apache spark code from https://github.com/apache/spark.git, i am getting below errors, Error:(9, 14) EventBatch is already defined as object EventBatch public class EventBatch extends org.apache.avro.specific.SpecificRecordBase implements

Re: Structured Streaming Questions

2017-06-28 Thread Tathagata Das
Answers inline. On Wed, Jun 28, 2017 at 10:27 AM, Revin Chalil wrote: > I am using Structured Streaming with Spark 2.1 and have some basic > questions. > > > > · Is there a way to automatically refresh the Hive Partitions > when using Parquet Sink with Partition?

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread ayan guha
--jars does not do wildcard expansion. List out the jars as comma separated. On Thu, 29 Jun 2017 at 5:17 am, satyajit vegesna wrote: > Have updated the pom.xml in external/kafka-0-10-sql folder, in yellow , as > below, and have run the command > build/mvn package

Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread Dongjoon Hyun
Did you follow the guide in `IDE Setup` -> `IntelliJ` section of http://spark.apache.org/developer-tools.html ? Bests, Dongjoon. On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > Hi All, > > When i try to build source code of apache spark code from >

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-28 Thread ayan guha
Hi Thanks for all of you, I could get HBase connector working. there are still some details around namespace is pending, but overall it is working well. Now, as usual, I would like to use the same concept into Structured Streaming. Is there any similar way I can use writeStream.format and use

RE: IDE for python

2017-06-28 Thread Sotola, Radim
Pycharm is good choice. I buy monthly subscription and can see that the PyCharm development continue (I mean that this is not tool which somebody develop and leave it without any upgrades). From: Abhinay Mehta [mailto:abhinay.me...@gmail.com] Sent: Wednesday, June 28, 2017 11:06 AM To: ayan

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Nick Pentreath
You will need to use PySpark vectors to store in a DataFrame. They can be created from Numpy arrays as follows: from pyspark.ml.linalg import Vectors df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas

Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread Yan Facai
It seems that split will always stop when count of nodes is less than max(X, Y). Hence, are they different? On Tue, Jun 27, 2017 at 11:07 PM, OBones wrote: > Hello, > > Reading around on the theory behind tree based regression, I concluded > that there are various reasons to

[PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear all, I am trying to store a NumPy array (loaded from an HDF5 dataset) into one cell of a DataFrame, but I am having problems. In short, my data layout is similar to a database, where I have a few columns with metadata (source of information, primary key,

RE: IDE for python

2017-06-28 Thread Md. Rezaul Karim
By the way, Pycharm from JetBrians also have a community edition which is free and open source. Moreover, if you are a student, you can use the professional edition for students as well. For more, see here https://www.jetbrains.com/student/ On Jun 28, 2017 11:18 AM, "Sotola, Radim"

(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread neha nihal
Thanks. Its working now. My test data had some labels which were not there in training set. On Wednesday, June 28, 2017, Pralabh Kumar > wrote: > Hi Neha > > This generally occurred when , you training data set have

RE: IDE for python

2017-06-28 Thread Sotola, Radim
I know. But I pay around 20Euro per month for all products from JetBrains and I think this is not so much – I Czech it is one evening in pub. From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] Sent: Wednesday, June 28, 2017 12:55 PM To: Sotola, Radim Cc:

Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread OBones
To me, they are. Y is used to control if a split is a valid candidate when deciding which one to follow. X is used to make a node a leaf if it has too few elements to even consider candidate splits. 颜发才(Yan Facai) wrote: It seems that split will always stop when count of nodes is less than

How to propagate Non-Empty Value in SPARQL Dataset

2017-06-28 Thread carloallocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec=

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear Nick, Thanks for your quick reply. I quickly implemented your proposal, but I do not see any improvement. In fact, the test data set of around 3 GB occupies a total of 10 GB in worker memory, and the execution time of queries is like 4 times slower

How to Fill Sparse Data With the Previous Non-Empty Value in SPARQL Dataset

2017-06-28 Thread Carlo Allocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec=

How to propagate Non-Empty Value in SPARQL Dataset

2017-06-28 Thread carloallocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec=

using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-28 Thread Robert Kudyba
We have a Big Data class planned and we’d like students to be able to start spark-shell or pyspark as their own user. However the Derby database locks the process from starting as another user: -rw-r--r-- 1 myuser staff 38 Jun 28 10:40 db.lck And these errors appear: ERROR PoolWatchThread:

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-28 Thread jeff saremi
I have to read up on the writer. But would the writer get records back from somewhere? I want to do a bulk operation and continue with the results in the form of a dataframe. Currently the UDF does this: 1 scalar -> 1 scalar the UDAF does this: M records -> 1 scalar I want this: M records ->

Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
It looks like your Spark job was running under user root, but you file system operation was running under user jomernik. Since Spark will call corresponding file system(such as HDFS, S3) to commit job(rename temporary file to persistent one), it should have correct authorization for both Spark and

Re: IDE for python

2017-06-28 Thread Xiaomeng Wan
Thanks for all of you. I will give Pycharm a try. Regards, Shawn On 28 June 2017 at 06:07, Sotola, Radim wrote: > I know. But I pay around 20Euro per month for all products from JetBrains > and I think this is not so much – I Czech it is one evening in pub. > > > >

Structured Streaming Questions

2017-06-28 Thread Revin Chalil
I am using Structured Streaming with Spark 2.1 and have some basic questions. * Is there a way to automatically refresh the Hive Partitions when using Parquet Sink with Partition? My query looks like below val queryCount = windowedCount

Re: IDE for python

2017-06-28 Thread Abhinay Mehta
I use Pycharm and it works a treat. The big advantage I find is that I can use the same command shortcuts that I do when developing with IntelliJ IDEA when doing Scala or Java. On 27 June 2017 at 23:29, ayan guha wrote: > Depends on the need. For data exploration, i use

Re: How do I find the time taken by each step in a stage in a Spark Job

2017-06-28 Thread ??????????
You can find the information from the spark UI ---Original--- From: "SRK" Date: 2017/6/28 02:36:37 To: "user"; Subject: How do I find the time taken by each step in a stage in a Spark Job Hi, How do I find the time taken by each step in a

Re: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread Pralabh Kumar
Hi Neha This generally occurred when , you training data set have some value of categorical variable ,which in not there in your testing data. For e.g you have column DAYS ,with value M,T,W in training data . But when your test data contains F ,then it say no key found exception . Please look

Fwd: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread neha nihal
Hi, I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text classification. TF-IDF feature extractor is also used. The training part runs without any issues and returns 100% accuracy. But when I am trying to do prediction using trained model and compute test error, it fails with