Yes.
What SHC version you were using?
If hitting any issues, you can post them in SHC github issues. There are
some threads about this.
On Fri, Jun 23, 2017 at 5:46 AM, ayan guha wrote:
> Hi
>
> Is it possible to use SHC from Hortonworks with pyspark? If so, any
> working
This is the code that chooses the partition for a key:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88
it's basically `math.abs(key.hashCode % numberOfPartitions)`
On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
ok, this one is doing what I want
SparkConf conf = new SparkConf()
.set("spark.sql.warehouse.dir",
"hdfs://localhost:9000/user/hive/warehouse")
.setMaster("local[*]")
.setAppName("TestApp");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession session =
Hi,
I have code that does the following using RDDs,
val outputPartitionCount = 300
val part = new MyOwnPartitioner(outputPartitionCount)
val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
where myRdd is correctly formed as key, value pairs. I am looking convert
this to use
Running a spark job on local machine and profiler results indicate that
highest time spent in
*sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.* Screenshot of
profiler result can be seen here : https://jpst.it/10i-V
Spark job(program) is performing IO (sc.wholeTextFile method of spark
apis),
what program do u use to profile Spark?
On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin wrote:
> That thread looks like the connection between the Spark process and
> jvisualvm. It's expected to show high up when doing sampling if the
> app is not doing much else.
>
> On
That thread looks like the connection between the Spark process and
jvisualvm. It's expected to show high up when doing sampling if the
app is not doing much else.
On Fri, Jun 23, 2017 at 10:46 AM, Reth RM wrote:
> Running a spark job on local machine and profiler results
gfortran runtime library is still required for Spark 2.1 for better
performance.
If it's not present on your nodes, you will see a warning message and a
pure JVM implementation will be used instead, but you will not get the best
performance.
Thanks
Yanbo
On Wed, Jun 21, 2017 at 5:30 PM, Saroj C
See reply here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html
On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson
wrote:
> Hi,
>
> I have seen that databricks have higher order functions
Since this function is used to compute QR decomposition for RowMatrix of a
tall and skinny shape, the output R is always with small rank.
[image: Inline image 1]
On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote:
> hi
>
> *def tallSkinnyQR(computeQ: Boolean = false):
Please consider to use other classification models such as logistic
regression or GBT. Naive bayes usually consider features as count, which is
not suitable to be used on features generated by one-hot encoder.
Thanks
Yanbo
On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:
Thanks for your prompt responses!
@Steve
I actually put my keytabs to all the nodes already. And I used them to
kinit on each server.
But how can I make spark to use my key tab and principle when I start
cluster or submit the job? Or is there a way to let spark use ticket cache
on each node?
I
Hi, all!
I was trying to read from a Kerberosed hadoop cluster from a standalone
spark cluster.
Right now, I encountered some authentication issues with Kerberos:
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client
Spark running with standalone cluster manager currently doesn't support
accessing security Hadoop. Basically the problem is that standalone mode
Spark doesn't have the facility to distribute delegation tokens.
Currently only Spark on YARN or local mode supports security Hadoop.
Thanks
Jerry
On
you'll need a filesystem with
* consistency
* accessibility everywhere
* supports a binding through one of the hadoop fs connectors
NFS-style distributed filesystems work with file:// ; things like glusterfs
need their own connectors.
you can use azure's wasb:// as a drop in replacement for
On 23 Jun 2017, at 10:22, Saisai Shao
> wrote:
Spark running with standalone cluster manager currently doesn't support
accessing security Hadoop. Basically the problem is that standalone mode Spark
doesn't have the facility to distribute
Hello,
I submit a spark job to YARN cluster with spark-submit command. the environment
is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per
node. The YARN sets 16G max of memory for every container. The job requests 6
of 8G memory of executors, and 8G of driver.
Hi,
I have a Spark-SQL Dataframe (reading from parquet) with some 20
columns. The data is divided into chunks of about 50 million rows each.
Among the columns is a "GROUP_ID", which is basically a string of 32
hexadecimal characters.
Following the guide [0] I thought to improve on
Hi,
I have a dataset with ~5M rows x 20 columns, containing a groupID and a
rowID. My goal is to check whether (some) columns contain more than a
fixed fraction (say, 50%) of missing (null) values within a group. If
this is found, the entire column is set to missing (null), for that
group.
I am trying to understand how spark partitoing works.
To understand this I have following piece of code on spark 1.6
def countByPartition1(rdd: RDD[(String, Int)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
def countByPartition2(rdd: RDD[String]) = {
Local mode
-
__Vikash Pareek
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi
Is it possible to use SHC from Hortonworks with pyspark? If so, any working
code sample available?
Also, I faced an issue while running the samples with Spark 2.0
"Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"
Any workaround?
Thanks in advance
--
Best
22 matches
Mail list logo