date:20170623

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread Weiqing Yang

Yes. What SHC version you were using? If hitting any issues, you can post them in SHC github issues. There are some threads about this. On Fri, Jun 23, 2017 at 5:46 AM, ayan guha wrote: > Hi > > Is it possible to use SHC from Hortonworks with pyspark? If so, any > working

Re: How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vadim Semenov

This is the code that chooses the partition for a key: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88 it's basically `math.abs(key.hashCode % numberOfPartitions)` On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-23 Thread Anton Kravchenko

ok, this one is doing what I want SparkConf conf = new SparkConf() .set("spark.sql.warehouse.dir", "hdfs://localhost:9000/user/hive/warehouse") .setMaster("local[*]") .setAppName("TestApp"); JavaSparkContext sc = new JavaSparkContext(conf); SparkSession session =

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman

Hi, I have code that does the following using RDDs, val outputPartitionCount = 300 val part = new MyOwnPartitioner(outputPartitionCount) val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) where myRdd is correctly formed as key, value pairs. I am looking convert this to use

Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Reth RM

Running a spark job on local machine and profiler results indicate that highest time spent in *sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.* Screenshot of profiler result can be seen here : https://jpst.it/10i-V Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),

Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Eduardo Mello

what program do u use to profile Spark? On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin wrote: > That thread looks like the connection between the Spark process and > jvisualvm. It's expected to show high up when doing sampling if the > app is not doing much else. > > On

Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Marcelo Vanzin

That thread looks like the connection between the Spark process and jvisualvm. It's expected to show high up when doing sampling if the app is not doing much else. On Fri, Jun 23, 2017 at 10:46 AM, Reth RM wrote: > Running a spark job on local machine and profiler results

Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang

gfortran runtime library is still required for Spark 2.1 for better performance. If it's not present on your nodes, you will see a warning message and a pure JVM implementation will be used instead, but you will not get the best performance. Thanks Yanbo On Wed, Jun 21, 2017 at 5:30 PM, Saroj C

Re: spark higher order functions

2017-06-23 Thread Yanbo Liang

See reply here: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson wrote: > Hi, > > I have seen that databricks have higher order functions

Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang

Since this function is used to compute QR decomposition for RowMatrix of a tall and skinny shape, the output R is always with small rank. [image: Inline image 1] On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote: > hi > > *def tallSkinnyQR(computeQ: Boolean = false):

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang

Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong

Thanks for your prompt responses! @Steve I actually put my keytabs to all the nodes already. And I used them to kinit on each server. But how can I make spark to use my key tab and principle when I start cluster or submit the job? Or is there a way to let spark use ticket cache on each node? I

Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong

Hi, all! I was trying to read from a Kerberosed hadoop cluster from a standalone spark cluster. Right now, I encountered some authentication issues with Kerberos: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Saisai Shao

Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute delegation tokens. Currently only Spark on YARN or local mode supports security Hadoop. Thanks Jerry On

Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran

you'll need a filesystem with * consistency * accessibility everywhere * supports a binding through one of the hadoop fs connectors NFS-style distributed filesystems work with file:// ; things like glusterfs need their own connectors. you can use azure's wasb:// as a drop in replacement for

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran

On 23 Jun 2017, at 10:22, Saisai Shao > wrote: Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute

Container exited with a non-zero exit code 1

2017-06-23 Thread Link Qian

Hello, I submit a spark job to YARN cluster with spark-submit command. the environment is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per node. The YARN sets 16G max of memory for every container. The job requests 6 of 8G memory of executors, and 8G of driver.

Spark Memory Optimization

2017-06-23 Thread Tw UxTLi51Nus

Hi, I have a Spark-SQL Dataframe (reading from parquet) with some 20 columns. The data is divided into chunks of about 50 million rows each. Among the columns is a "GROUP_ID", which is basically a string of 32 hexadecimal characters. Following the guide [0] I thought to improve on

OutOfMemoryError

2017-06-23 Thread Tw UxTLi51Nus

Hi, I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.

How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vikash Pareek

I am trying to understand how spark partitoing works. To understand this I have following piece of code on spark 1.6 def countByPartition1(rdd: RDD[(String, Int)]) = { rdd.mapPartitions(iter => Iterator(iter.length)) } def countByPartition2(rdd: RDD[String]) = {

Re: Number Of Partitions in RDD

2017-06-23 Thread Vikash Pareek

Local mode - __Vikash Pareek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread ayan guha

Hi Is it possible to use SHC from Hortonworks with pyspark? If so, any working code sample available? Also, I faced an issue while running the samples with Spark 2.0 "Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging" Any workaround? Thanks in advance -- Best

Re: HDP 2.5 - Python - Spark-On-Hbase

Re: How does HashPartitioner distribute data in Spark?

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

Spark job profiler results showing high TCP cpu time

Re: Spark job profiler results showing high TCP cpu time

Re: Spark job profiler results showing high TCP cpu time

Re: gfortran runtime library for Spark

Re: spark higher order functions

Re: RowMatrix: tallSkinnyQR

Re: Help in Parsing 'Categorical' type of data

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

Question about standalone Spark cluster reading from Kerberosed hadoop

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

Re: Using YARN w/o HDFS

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

Container exited with a non-zero exit code 1

Spark Memory Optimization

OutOfMemoryError

How does HashPartitioner distribute data in Spark?

Re: Number Of Partitions in RDD

HDP 2.5 - Python - Spark-On-Hbase

22 matches

Site Navigation

Mail list logo

Footer information