Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic
regression or GBT. Naive bayes usually consider features as count, which is
not suitable to be used on features generated by one-hot encoder.

Thanks
Yanbo

On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti  wrote:

> Hi,
>
> I am trying to run Naive Bayes Model using Spark ML libraries, in Java.
> The sample snippet of dataset is given below:
>
> *Raw Data* -
>
>
> But, as the input data needs to in numeric, so I am using
> *one-hot-encoder* on the Gender field[m->0,1][f->1,0]; and the finally
> the 'features' vector is inputted to Model, and I could get the Output.
>
> *Transformed Data* -
>
>
> But the model *results are not correct *as the 'Gender' field[Originally,
> Categorical] is now considered as a continuous field after one-hot encoding
> transformations.
>
> *Expectation* is that - for 'continuous data', mean and variance ; and
> for 'categorical data', the number of occurrences of different categories,
> is to be calculated. [In, my case, mean and variances are calculated even
> for the Gender Field].
>
> So, is there any way by which I can indicate to the model that a
> particular data field is 'categorical' by nature?
>
> Thanks
>
> Best Regards
> Amlan Jyoti
>
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>


Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a
tall and skinny shape, the output R is always with small rank.
[image: Inline image 1]

On Fri, Jun 9, 2017 at 10:33 PM, Arun  wrote:

> hi
>
> *def  tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix,
> Matrix]*
>
> *In output of this method Q is distributed matrix*
> *and R is local Matrix*
>
> *Whats  the reason R is Local Matrix?*
>
>
> -Arun
>


Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html

On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson 
wrote:

> Hi,
>
> I have seen that databricks have higher order functions (
> https://docs.databricks.com/_static/notebooks/higher-order-functions.html,
> https://databricks.com/blog/2017/05/24/working-with-
> nested-data-using-higher-order-functions-in-sql-on-databricks.html) which
> basically allows to do generic operations on arrays (and I imagine on maps
> too).
>
> I was wondering if there is an equivalent on vanilla spark.
>
>
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> --
> View this message in context: spark higher order functions
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better
performance.
If it's not present on your nodes, you will see a warning message and a
pure JVM implementation will be used instead, but you will not get the best
performance.

Thanks
Yanbo

On Wed, Jun 21, 2017 at 5:30 PM, Saroj C  wrote:

> Dear All,
>  Can you please let me know, if gfortran runtime library is still
> required for Spark 2.1, for better performance. Note, I am using Java APIs
> for Spark ?
>
> Thanks & Regards
> Saroj
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>


Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman
Hi,

I have code that does the following using RDDs,

val outputPartitionCount = 300
val part = new MyOwnPartitioner(outputPartitionCount)
val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)

where myRdd is correctly formed as key, value pairs. I am looking convert
this to use Dataset/Dataframe instead of RDDs, so my question is:

Is there custom partitioning of Dataset/Dataframe implemented in Spark?
Can I accomplish the partial sort using mapPartitions on the resulting
partitioned Dataset/Dataframe?

Any thoughts?

Regards,
Keith.

http://keith-chapman.com


Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-23 Thread Anton Kravchenko
ok, this one is doing what I want

SparkConf conf = new SparkConf()
.set("spark.sql.warehouse.dir",
"hdfs://localhost:9000/user/hive/warehouse")
.setMaster("local[*]")
.setAppName("TestApp");

JavaSparkContext sc = new JavaSparkContext(conf);

SparkSession session = SparkSession
.builder()
.appName("TestApp").master("local[*]")
.getOrCreate();

Integer _bcv =  123;
Broadcast bcv = sc.broadcast(_bcv);

WrapBCV.setBCV(bcv); // implemented in WrapBCV.java

df_sql.foreachPartition(new ProcessSinglePartition()); //implemented
in ProcessSinglePartition.java

Where:
ProcessSinglePartition.java

public class ProcessSinglePartition implements ForeachPartitionFunction  {
public void call(Iterator it) throws Exception {
System.out.println(WrapBCV.getBCV());
}
}

WrapBCV.java

public class WrapBCV {
private static Broadcast bcv;
public static void setBCV(Broadcast setbcv){ bcv = setbcv; }
public static Integer getBCV()
{
return bcv.value();
}
}


On Fri, Jun 16, 2017 at 3:35 AM, Ryan  wrote:

> I don't think Broadcast itself can be serialized. you can get the value
> out on the driver side and refer to it in foreach, then the value would be
> serialized with the lambda expr and sent to workers.
>
> On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko <
> kravchenko.anto...@gmail.com> wrote:
>
>> How one would access a broadcasted variable from within
>> ForeachPartitionFunction  Spark(2.0.1) Java API ?
>>
>> Integer _bcv = 123;
>> Broadcast bcv = spark.sparkContext().broadcast(_bcv);
>> Dataset df_sql = spark.sql("select * from atable");
>>
>> df_sql.foreachPartition(new ForeachPartitionFunction() {
>> public void call(Iterator t) throws Exception {
>> System.out.println(bcv.value());
>> }}
>> );
>>
>>
>


Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Eduardo Mello
what program do u use to profile Spark?

On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin  wrote:

> That thread looks like the connection between the Spark process and
> jvisualvm. It's expected to show high up when doing sampling if the
> app is not doing much else.
>
> On Fri, Jun 23, 2017 at 10:46 AM, Reth RM  wrote:
> > Running a spark job on local machine and profiler results indicate that
> > highest time spent in sun.rmi.transport.tcp.TCPTransport$
> ConnectionHandler.
> > Screenshot of profiler result can be seen here : https://jpst.it/10i-V
> >
> > Spark job(program) is performing IO (sc.wholeTextFile method of spark
> apis),
> > Reads files from local file system and analyses the text to obtain
> tokens.
> >
> > Any thoughts and suggestions?
> >
> > Thanks.
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Marcelo Vanzin
That thread looks like the connection between the Spark process and
jvisualvm. It's expected to show high up when doing sampling if the
app is not doing much else.

On Fri, Jun 23, 2017 at 10:46 AM, Reth RM  wrote:
> Running a spark job on local machine and profiler results indicate that
> highest time spent in sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.
> Screenshot of profiler result can be seen here : https://jpst.it/10i-V
>
> Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),
> Reads files from local file system and analyses the text to obtain tokens.
>
> Any thoughts and suggestions?
>
> Thanks.
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Reth RM
Running a spark job on local machine and profiler results indicate that
highest time spent in
*sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.* Screenshot of
profiler result can be seen here : https://jpst.it/10i-V

Spark job(program) is performing IO (sc.wholeTextFile method of spark
apis), Reads files from local file system and analyses the text to obtain
tokens.

Any thoughts and suggestions?

Thanks.


Re: How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vadim Semenov
This is the code that chooses the partition for a key:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88

it's basically `math.abs(key.hashCode % numberOfPartitions)`

On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
vikash.par...@infoobjects.com> wrote:

> I am trying to understand how spark partitoing works.
>
> To understand this I have following piece of code on spark 1.6
>
> def countByPartition1(rdd: RDD[(String, Int)]) = {
> rdd.mapPartitions(iter => Iterator(iter.length))
> }
> def countByPartition2(rdd: RDD[String]) = {
> rdd.mapPartitions(iter => Iterator(iter.length))
> }
>
> //RDDs Creation
> val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa",
> 1)), 8)
> countByPartition(rdd1).collect()
> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>
> val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
> countByPartition(rdd2).collect()
> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>
> In both the cases data is distributed uniformaly.
> I do have following questions on the basis of above observation:
>
>  1. In case of rdd1, hash partitioning should calculate hashcode of key
> (i.e. "aa" in this case), so all records should go to single partition
> instead of uniform distribution?
>  2. In case of rdd2, there is no key value pair so how hash partitoning
> going to work i.e. what is the key to calculate hashcode?
>
> I have followed @zero323 answer but not getting answer of these.
> https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work
>
>
>
>
> -
>
> __Vikash Pareek
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-
> tp28785.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread Weiqing Yang
Yes.
What SHC version you were using?
If hitting any issues, you can post them in SHC github issues. There are
some threads about this.

On Fri, Jun 23, 2017 at 5:46 AM, ayan guha  wrote:

> Hi
>
> Is it possible to use SHC from Hortonworks with pyspark? If so, any
> working code sample available?
>
> Also, I faced an issue while running the samples with Spark 2.0
>
> "Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"
>
> Any workaround?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>


HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread ayan guha
Hi

Is it possible to use SHC from Hortonworks with pyspark? If so, any working
code sample available?

Also, I faced an issue while running the samples with Spark 2.0

"Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"

Any workaround?

Thanks in advance

-- 
Best Regards,
Ayan Guha


Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Thanks for your prompt responses!

@Steve

I actually put my keytabs to all the nodes already. And I used them to
kinit on each server.

But how can I make spark to use my key tab and principle when I start
cluster or submit the job? Or is there a way to let spark use ticket cache
on each node?

I tried --keytab and --principle when I submit the job, still get the same
error. I guess that's for YARN only.

On Fri, Jun 23, 2017 at 18:50 Steve Loughran  wrote:

> On 23 Jun 2017, at 10:22, Saisai Shao  wrote:
>
> Spark running with standalone cluster manager currently doesn't support
> accessing security Hadoop. Basically the problem is that standalone mode
> Spark doesn't have the facility to distribute delegation tokens.
>
> Currently only Spark on YARN or local mode supports security Hadoop.
>
> Thanks
> Jerry
>
>
> There's possibly an ugly workaround where you ssh in to every node and log
> in direct to your kdc using a keytab you pushed out...that would eliminate
> the need for anything related to hadoop tokens. After all, that's
> essentially what spark-on-yarn does when when you give it keytab.
>
>
> see also:
> https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details
>
> On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong  wrote:
>
>> Hi, all!
>>
>> I was trying to read from a Kerberosed hadoop cluster from a standalone
>> spark cluster.
>> Right now, I encountered some authentication issues with Kerberos:
>>
>>
>> java.io.IOException: Failed on local exception: java.io.IOException: 
>> org.apache.hadoop.security.AccessControlException: Client cannot 
>> authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: 
>> ""; destination host is: XXX;
>>
>>
>>
>> I checked with klist, and principle/realm is correct.
>> I also used hdfs command line to poke HDFS from all the nodes, and it
>> worked.
>> And if I submit job using local(client) mode, the job worked fine.
>>
>> I tried to put everything from hadoop/conf to spark/conf and hive/conf to
>> spark/conf.
>> Also tried edit spark/conf/spark-env.sh to add
>> SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
>> and tried to export them in .bashrc as well.
>>
>> However, I'm still experiencing the same exception.
>>
>> Then I read some concerning posts about problems with
>> kerberosed hadoop, some post like the following one:
>> http://blog.stratio.com/spark-kerberos-safe-story/
>> , which indicates that we can not access to kerberosed hdfs using
>> standalone spark cluster.
>>
>> I'm using spark 2.1.1, is it still the case that we can't access
>> kerberosed hdfs with 2.1.1?
>>
>> Thanks!
>>
>>
>> Best regards,
>> Mu
>>
>>
>


Container exited with a non-zero exit code 1

2017-06-23 Thread Link Qian
Hello,


I submit a spark job to YARN cluster with spark-submit command. the environment 
is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per 
node. The YARN sets 16G max of memory for every container. The job requests 6 
of 8G memory of executors, and  8G of driver. However, I alway get the errors 
after try submit the job several times.  Any help?


--- here are the error logs of Application Master for the job 
--


17/06/22 15:18:44 INFO yarn.YarnAllocator: Completed container 
container_1498115278902_0001_02_13 (state: COMPLETE, exit status: 1)
17/06/22 15:18:44 INFO yarn.YarnAllocator: Container marked as failed: 
container_1498115278902_0001_02_13. Exit status: 1. Diagnostics: Exception 
from container-launch.
Container id: container_1498115278902_0001_02_13
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1



  Here is the yarn application logs of the job.

LogLength:2611
Log Contents:
17/06/22 15:18:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
17/06/22 15:18:10 INFO spark.SecurityManager: Changing view acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: Changing modify acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(yarn, root); users 
with modify permissions: Set(yarn, root)
17/06/22 15:18:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/06/22 15:18:10 INFO Remoting: Starting remoting
17/06/22 15:18:10 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO util.Utils: Successfully started service 
'driverPropsFetcher' on port 45701.
17/06/22 15:18:40 WARN security.UserGroupInformation: 
PriviledgedActionException as:root (auth:SIMPLE) 
cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1684)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:139)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:235)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:155)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
... 4 more


 a snippet of RM log for the job -

2017-06-22 15:18:41,586 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released 
container container_1498115278902_0001_02_14 of capacity
  on host dn006:8041, which currently has 0 containers, 
 used and  available, release 
resources=true
2017-06-22 15:18:41,586 INFO 

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran

On 23 Jun 2017, at 10:22, Saisai Shao 
> wrote:

Spark running with standalone cluster manager currently doesn't support 
accessing security Hadoop. Basically the problem is that standalone mode Spark 
doesn't have the facility to distribute delegation tokens.

Currently only Spark on YARN or local mode supports security Hadoop.

Thanks
Jerry


There's possibly an ugly workaround where you ssh in to every node and log in 
direct to your kdc using a keytab you pushed out...that would eliminate the 
need for anything related to hadoop tokens. After all, that's essentially what 
spark-on-yarn does when when you give it keytab.


see also:  
https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details

On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong 
> wrote:
Hi, all!

I was trying to read from a Kerberosed hadoop cluster from a standalone spark 
cluster.
Right now, I encountered some authentication issues with Kerberos:



java.io.IOException: Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]; Host Details : local host is: ""; 
destination host is: XXX;


I checked with klist, and principle/realm is correct.
I also used hdfs command line to poke HDFS from all the nodes, and it worked.
And if I submit job using local(client) mode, the job worked fine.

I tried to put everything from hadoop/conf to spark/conf and hive/conf to 
spark/conf.
Also tried edit spark/conf/spark-env.sh to add 
SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
 and tried to export them in .bashrc as well.

However, I'm still experiencing the same exception.

Then I read some concerning posts about problems with kerberosed hadoop, some 
post like the following one:
http://blog.stratio.com/spark-kerberos-safe-story/
, which indicates that we can not access to kerberosed hdfs using standalone 
spark cluster.

I'm using spark 2.1.1, is it still the case that we can't access kerberosed 
hdfs with 2.1.1?

Thanks!


Best regards,
Mu





Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran
you'll need a filesystem with

* consistency
* accessibility everywhere
* supports a binding through one of the hadoop fs connectors

NFS-style distributed filesystems work with file:// ; things like glusterfs 
need their own connectors.

you can use azure's wasb:// as a drop in replacement for HDFS in Azure.I think 
google cloud storage is similar, but haven't played with it. Ask google.

You cannot do the same for S3 except on EMR and Amazon's premium emrfs:// 
offering, which adds the consistency layer.



On 22 Jun 2017, at 00:50, Alaa Zubaidi (PDF) 
> wrote:

Hi,

Can we run Spark on YARN with out installing HDFS?
If yes, where would HADOOP_CONF_DIR point to?

Regards,

This message may contain confidential and privileged information. If it has 
been sent to you in error, please reply to advise the sender of the error and 
then immediately permanently delete it and all attachments to it from your 
systems. If you are not the intended recipient, do not read, copy, disclose or 
otherwise use this message or any attachments to it. The sender disclaims any 
liability for such unauthorized use. PLEASE NOTE that all incoming e-mails sent 
to PDF e-mail accounts will be archived and may be scanned by us and/or by 
external service providers to detect and prevent threats to our systems, 
investigate illegal or inappropriate behavior, and/or eliminate unsolicited 
promotional e-mails (“spam”). If you have any concerns about this process, 
please contact us at legal.departm...@pdf.com.



Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Saisai Shao
Spark running with standalone cluster manager currently doesn't support
accessing security Hadoop. Basically the problem is that standalone mode
Spark doesn't have the facility to distribute delegation tokens.

Currently only Spark on YARN or local mode supports security Hadoop.

Thanks
Jerry

On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong  wrote:

> Hi, all!
>
> I was trying to read from a Kerberosed hadoop cluster from a standalone
> spark cluster.
> Right now, I encountered some authentication issues with Kerberos:
>
>
> java.io.IOException: Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]; Host Details : local host is: ""; 
> destination host is: XXX;
>
>
>
> I checked with klist, and principle/realm is correct.
> I also used hdfs command line to poke HDFS from all the nodes, and it
> worked.
> And if I submit job using local(client) mode, the job worked fine.
>
> I tried to put everything from hadoop/conf to spark/conf and hive/conf to
> spark/conf.
> Also tried edit spark/conf/spark-env.sh to add SPARK_SUBMIT_OPTS/SPARK_
> MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR, and tried to
> export them in .bashrc as well.
>
> However, I'm still experiencing the same exception.
>
> Then I read some concerning posts about problems with
> kerberosed hadoop, some post like the following one:
> http://blog.stratio.com/spark-kerberos-safe-story/
> , which indicates that we can not access to kerberosed hdfs using
> standalone spark cluster.
>
> I'm using spark 2.1.1, is it still the case that we can't access
> kerberosed hdfs with 2.1.1?
>
> Thanks!
>
>
> Best regards,
> Mu
>
>


Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Hi, all!

I was trying to read from a Kerberosed hadoop cluster from a standalone
spark cluster.
Right now, I encountered some authentication issues with Kerberos:


java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
""; destination host is: XXX;



I checked with klist, and principle/realm is correct.
I also used hdfs command line to poke HDFS from all the nodes, and it
worked.
And if I submit job using local(client) mode, the job worked fine.

I tried to put everything from hadoop/conf to spark/conf and hive/conf to
spark/conf.
Also tried edit spark/conf/spark-env.sh to add
SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
and tried to export them in .bashrc as well.

However, I'm still experiencing the same exception.

Then I read some concerning posts about problems with
kerberosed hadoop, some post like the following one:
http://blog.stratio.com/spark-kerberos-safe-story/
, which indicates that we can not access to kerberosed hdfs using
standalone spark cluster.

I'm using spark 2.1.1, is it still the case that we can't access kerberosed
hdfs with 2.1.1?

Thanks!


Best regards,
Mu


Re: Number Of Partitions in RDD

2017-06-23 Thread Vikash Pareek
Local mode



-

__Vikash Pareek
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vikash Pareek
I am trying to understand how spark partitoing works.

To understand this I have following piece of code on spark 1.6

def countByPartition1(rdd: RDD[(String, Int)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
def countByPartition2(rdd: RDD[String]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}

//RDDs Creation
val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa",
1)), 8)
countByPartition(rdd1).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition(rdd2).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
 
In both the cases data is distributed uniformaly.
I do have following questions on the basis of above observation:

 1. In case of rdd1, hash partitioning should calculate hashcode of key
(i.e. "aa" in this case), so all records should go to single partition
instead of uniform distribution?
 2. In case of rdd2, there is no key value pair so how hash partitoning
going to work i.e. what is the key to calculate hashcode?  

I have followed @zero323 answer but not getting answer of these.
https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work




-

__Vikash Pareek
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-tp28785.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Memory Optimization

2017-06-23 Thread Tw UxTLi51Nus

Hi,

I have a Spark-SQL Dataframe (reading from parquet) with some 20 
columns. The data is divided into chunks of about 50 million rows each. 
Among the columns is a "GROUP_ID", which is basically a string of 32 
hexadecimal characters.


Following the guide [0] I thought to improve on performance and memory 
consumption if I converted this to some other format. I came up with two 
possiblities:
- cut in half, convert to "long" (loosing some information, but is fine 
for "within-chunk-processing")

- convert to byte array

Doing some math, without considering the "objects overhead", this should 
cut the memory usage of the "GROUP_ID" column in half (for byte array) / 
in quarter (for the "long" version).
However, neither in the SparkUI (Storage Tab) nor using 
"SizeEstimator.estimate()" showed such an effect - memory consumption 
essentially stayed the same.


In case lineage (the dataframe's "history") mattered, I also tried to 
write the new dataframe to parquet and read it up again - the result 
still was the same.


Usually I use Python, but in order to use SizeEstimator I did this one 
in Java. My code - for converting to "long" - is on [1].


Any idea about what I am missing?

Thanks!



[0] 
http://spark.apache.org/docs/latest/tuning.html#tuning-data-structures
[1] 
https://gist.github.com/TwUxTLi51Nus/2aba16163bf01a4d6417bb65f6fe5b38


--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



OutOfMemoryError

2017-06-23 Thread Tw UxTLi51Nus

Hi,

I have a dataset with ~5M rows x 20 columns, containing a groupID and a 
rowID. My goal is to check whether (some) columns contain more than a 
fixed fraction (say, 50%) of missing (null) values within a group. If 
this is found, the entire column is set to missing (null), for that 
group.


The Problem:
The loop runs like a charm during the first iterations, but towards the 
end, around the 6th or 7th iteration I see my CPU utilization dropping 
(using 1 instead of 6 cores). Along with that, execution time for one 
iteration increases significantly. At some point, I get an OutOfMemory 
Error:


* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed 
two classes taking up lots of memory:


* java.lang.Thread
  - char (2G)
  - scala.collection.IndexedSeqLike
  - scala.collection.mutable.WrappedArray (1G)
  - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
  - org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
  - java.lang.String
  - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to something 
low (e.g. 1) did not solve the issue.


Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found on 
the StackOverflow Question [1] I posted, but did not receive any answers 
until now.


Thanks!

[0] 
https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74
[1] 
http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace


PS: As a workaround, I have been using "checkpoint" after every few 
iterations.



--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org