Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread Weiqing Yang
Yes. What SHC version you were using? If hitting any issues, you can post them in SHC github issues. There are some threads about this. On Fri, Jun 23, 2017 at 5:46 AM, ayan guha wrote: > Hi > > Is it possible to use SHC from Hortonworks with pyspark? If so, any > working

Re: How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vadim Semenov
This is the code that chooses the partition for a key: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88 it's basically `math.abs(key.hashCode % numberOfPartitions)` On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-23 Thread Anton Kravchenko
ok, this one is doing what I want SparkConf conf = new SparkConf() .set("spark.sql.warehouse.dir", "hdfs://localhost:9000/user/hive/warehouse") .setMaster("local[*]") .setAppName("TestApp"); JavaSparkContext sc = new JavaSparkContext(conf); SparkSession session =

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman
Hi, I have code that does the following using RDDs, val outputPartitionCount = 300 val part = new MyOwnPartitioner(outputPartitionCount) val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) where myRdd is correctly formed as key, value pairs. I am looking convert this to use

Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Reth RM
Running a spark job on local machine and profiler results indicate that highest time spent in *sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.* Screenshot of profiler result can be seen here : https://jpst.it/10i-V Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),

Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Eduardo Mello
what program do u use to profile Spark? On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin wrote: > That thread looks like the connection between the Spark process and > jvisualvm. It's expected to show high up when doing sampling if the > app is not doing much else. > > On

Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Marcelo Vanzin
That thread looks like the connection between the Spark process and jvisualvm. It's expected to show high up when doing sampling if the app is not doing much else. On Fri, Jun 23, 2017 at 10:46 AM, Reth RM wrote: > Running a spark job on local machine and profiler results

Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better performance. If it's not present on your nodes, you will see a warning message and a pure JVM implementation will be used instead, but you will not get the best performance. Thanks Yanbo On Wed, Jun 21, 2017 at 5:30 PM, Saroj C

Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson wrote: > Hi, > > I have seen that databricks have higher order functions

Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a tall and skinny shape, the output R is always with small rank. [image: Inline image 1] On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote: > hi > > *def tallSkinnyQR(computeQ: Boolean = false):

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Thanks for your prompt responses! @Steve I actually put my keytabs to all the nodes already. And I used them to kinit on each server. But how can I make spark to use my key tab and principle when I start cluster or submit the job? Or is there a way to let spark use ticket cache on each node? I

Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Hi, all! I was trying to read from a Kerberosed hadoop cluster from a standalone spark cluster. Right now, I encountered some authentication issues with Kerberos: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Saisai Shao
Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute delegation tokens. Currently only Spark on YARN or local mode supports security Hadoop. Thanks Jerry On

Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran
you'll need a filesystem with * consistency * accessibility everywhere * supports a binding through one of the hadoop fs connectors NFS-style distributed filesystems work with file:// ; things like glusterfs need their own connectors. you can use azure's wasb:// as a drop in replacement for

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran
On 23 Jun 2017, at 10:22, Saisai Shao > wrote: Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute

Container exited with a non-zero exit code 1

2017-06-23 Thread Link Qian
Hello, I submit a spark job to YARN cluster with spark-submit command. the environment is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per node. The YARN sets 16G max of memory for every container. The job requests 6 of 8G memory of executors, and 8G of driver.

Spark Memory Optimization

2017-06-23 Thread Tw UxTLi51Nus
Hi, I have a Spark-SQL Dataframe (reading from parquet) with some 20 columns. The data is divided into chunks of about 50 million rows each. Among the columns is a "GROUP_ID", which is basically a string of 32 hexadecimal characters. Following the guide [0] I thought to improve on

OutOfMemoryError

2017-06-23 Thread Tw UxTLi51Nus
Hi, I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.

How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vikash Pareek
I am trying to understand how spark partitoing works. To understand this I have following piece of code on spark 1.6 def countByPartition1(rdd: RDD[(String, Int)]) = { rdd.mapPartitions(iter => Iterator(iter.length)) } def countByPartition2(rdd: RDD[String]) = {

Re: Number Of Partitions in RDD

2017-06-23 Thread Vikash Pareek
Local mode - __Vikash Pareek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread ayan guha
Hi Is it possible to use SHC from Hortonworks with pyspark? If so, any working code sample available? Also, I faced an issue while running the samples with Spark 2.0 "Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging" Any workaround? Thanks in advance -- Best