Re: spark kafka consumer with kerberos

2017-03-31 Thread Saisai Shao
Hi Bill,

Normally Kerberos principal and keytab should be enough, because keytab
could actually represent the password. Did you configure SASL/GSSAPI or
SASL/Plain for KafkaClient?
http://kafka.apache.org/documentation.html#security_sasl

Actually this is more like a Kafka question and normally should be a
configuration issue, I would suggest you to ask this question in Kafka mail
list.

Thanks
Saisai


On Fri, Mar 31, 2017 at 10:28 PM, Bill Schwanitz  wrote:

> Saisai,
>
> Yea that seems to have helped. Looks like the kerberos ticket when I
> submit does not get passed to the executor?
>
> ... 3 more
> Caused by: org.apache.kafka.common.KafkaException:
> javax.security.auth.login.LoginException: Unable to obtain password from
> user
>
> at org.apache.kafka.common.network.SaslChannelBuilder.
> configure(SaslChannelBuilder.java:86)
> at org.apache.kafka.common.network.ChannelBuilders.
> create(ChannelBuilders.java:70)
> at org.apache.kafka.clients.ClientUtils.createChannelBuilder(
> ClientUtils.java:83)
> at org.apache.kafka.clients.consumer.KafkaConsumer.(
> KafkaConsumer.java:623)
> ... 14 more
> Caused by: javax.security.auth.login.LoginException: Unable to obtain
> password from user
>
>
> On Fri, Mar 31, 2017 at 9:08 AM, Saisai Shao 
> wrote:
>
>> Hi Bill,
>>
>> The exception is from executor side. From the gist you provided, looks
>> like the issue is that you only configured java options in driver side, I
>> think you should also configure this in executor side. You could refer to
>> here (https://github.com/hortonworks-spark/skc#running-on-a-
>> kerberos-enabled-cluster).
>>
>> --files key.conf#key.conf,v.keytab#v.keytab
>> --driver-java-options "-Djava.security.auth.login.config=./key.conf"
>> --conf 
>> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf"
>>
>>
>> On Fri, Mar 31, 2017 at 1:58 AM, Bill Schwanitz 
>> wrote:
>>
>>> I'm working on a poc spark job to pull data from a kafka topic with
>>> kerberos enabled ( required ) brokers.
>>>
>>> The code seems to connect to kafka and enter a polling mode. When I toss
>>> something onto the topic I get an exception which I just can't seem to
>>> figure out. Any ideas?
>>>
>>> I have a full gist up at https://gist.github.com/bil
>>> sch/17f4a4c4303ed3e004e2234a5904f0de with a lot of details. If I use
>>> the hdfs/spark client code for just normal operations everything works fine
>>> but for some reason the streaming code is having issues. I have verified
>>> the KafkaClient object is in the jaas config. The keytab is good etc.
>>>
>>> Guessing I'm doing something wrong I just have not figured out what yet!
>>> Any thoughts?
>>>
>>> The exception:
>>>
>>> 17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>>> 0, host5.some.org.net): org.apache.kafka.common.KafkaException: Failed
>>> to construct kafka consumer
>>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>>> Consumer.java:702)
>>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>>> Consumer.java:557)
>>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>>> Consumer.java:540)
>>> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.>> t>(CachedKafkaConsumer.scala:47)
>>> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get
>>> (CachedKafkaConsumer.scala:157)
>>> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterato
>>> r.(KafkaRDD.scala:210)
>>> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRD
>>> D.scala:185)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:86)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: org.apache.kafka.common.KafkaException:
>>> org.apache.kafka.common.KafkaException: Jaas configuration not found
>>> at org.apache.kafka.common.network.SaslChannelBuilder.configure
>>> (SaslChannelBuilder.java:86)
>>> at org.apache.kafka.common.network.ChannelBuilders.create(Chann
>>> elBuilders.java:70)
>>> at org.apache.kafka.clients.ClientUtils.createChannelBuilder(Cl
>>> ientUtils.java:83)
>>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>>> Consumer.java:623)
>>> ... 14 more
>>> Caused by: org.apache.kafka.common.KafkaException: Jaas configuration
>>> not found
>>> at org.apache.kafka.common.security.kerberos.KerberosLogin.getS
>>> erviceName(KerberosLogin.java:299)
>>> at org.apache.kafka.common.security.kerberos.KerberosLogin.conf
>>> igure(KerberosLogin.java:103)
>>> at 

Partitioning in spark while reading from RDBMS via JDBC

2017-03-31 Thread Devender Yadav
Hi All,


I am running spark in cluster mode and reading data from RDBMS via JDBC.

As per spark 
docs,
 these partitioning parameters describe how to partition the table when reading 
in parallel from multiple workers:

partitionColumn,
lowerBound,
upperBound,
numPartitions


These are optional parameters.

What would happen if I don't specify these:

  *   Only 1 worker read the whole data?
  *   If it still reads parallelly, how does it partition data?



Regards,
Devender








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Looking at EMR Logs

2017-03-31 Thread Neil Jonkers
Modifying spark.eventLog.dir to point to a S3 path, you will encounter the
following exception in Spark history log on path:
/var/log/spark/spark-history-server.out


Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)

To move past this issue, we can do the following. This is for EMR Release:
emr-5.4.0

cd /usr/lib/spark/jars
sudo ln -s /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.15.0.jar
emrfs.jar

Now Spark history server will startup correctly and you can review the
Spark event logs on S3.


On Fri, Mar 31, 2017 at 4:46 PM, Vadim Semenov 
wrote:

> You can provide your own log directory, where Spark log will be saved, and
> that you could replay afterwards.
>
> Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and
> run it.
> Note! The path `s3://bucket/some/directory` must exist before you run your
> job, it'll not be created automatically.
>
> The Spark HistoryServer on EMR won't show you anything because it's
> looking for logs in `hdfs:///var/log/spark/apps` by default.
>
> After that you can either copy the log files from s3 to the hdfs path
> above, or you can copy them locally to `/tmp/spark-events` (the default
> directory for spark logs) and run the history server like:
> ```
> cd /usr/local/src/spark-1.6.1-bin-hadoop2.6
> sbin/start-history-server.sh
> ```
> and then open http://localhost:18080
>
>
>
>
> On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay 
> wrote:
>
>> I am looking for tips on evaluating my Spark job after it has run.
>>
>> I know that right now I can look at the history of jobs through the web
>> ui. I also know how to look at the current resources being used by a
>> similar web ui.
>>
>> However, I would like to look at the logs after the job is finished to
>> evaluate such things as how many tasks were completed, how many executors
>> were used, etc. I currently save my logs to S3.
>>
>> Thanks!
>>
>> Henry
>>
>> --
>> Paul Henry Tremblay
>> Robert Half Technology
>>
>
>


Re: Looking at EMR Logs

2017-03-31 Thread Vadim Semenov
You can provide your own log directory, where Spark log will be saved, and
that you could replay afterwards.

Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and
run it.
Note! The path `s3://bucket/some/directory` must exist before you run your
job, it'll not be created automatically.

The Spark HistoryServer on EMR won't show you anything because it's looking
for logs in `hdfs:///var/log/spark/apps` by default.

After that you can either copy the log files from s3 to the hdfs path
above, or you can copy them locally to `/tmp/spark-events` (the default
directory for spark logs) and run the history server like:
```
cd /usr/local/src/spark-1.6.1-bin-hadoop2.6
sbin/start-history-server.sh
```
and then open http://localhost:18080




On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay 
wrote:

> I am looking for tips on evaluating my Spark job after it has run.
>
> I know that right now I can look at the history of jobs through the web
> ui. I also know how to look at the current resources being used by a
> similar web ui.
>
> However, I would like to look at the logs after the job is finished to
> evaluate such things as how many tasks were completed, how many executors
> were used, etc. I currently save my logs to S3.
>
> Thanks!
>
> Henry
>
> --
> Paul Henry Tremblay
> Robert Half Technology
>


Re: spark kafka consumer with kerberos

2017-03-31 Thread Bill Schwanitz
Saisai,

Yea that seems to have helped. Looks like the kerberos ticket when I submit
does not get passed to the executor?

... 3 more
Caused by: org.apache.kafka.common.KafkaException:
javax.security.auth.login.LoginException: Unable to obtain password from
user

at
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)
at
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:623)
... 14 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain
password from user


On Fri, Mar 31, 2017 at 9:08 AM, Saisai Shao  wrote:

> Hi Bill,
>
> The exception is from executor side. From the gist you provided, looks
> like the issue is that you only configured java options in driver side, I
> think you should also configure this in executor side. You could refer to
> here (https://github.com/hortonworks-spark/skc#running-
> on-a-kerberos-enabled-cluster).
>
> --files key.conf#key.conf,v.keytab#v.keytab
> --driver-java-options "-Djava.security.auth.login.config=./key.conf"
> --conf 
> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf"
>
>
> On Fri, Mar 31, 2017 at 1:58 AM, Bill Schwanitz  wrote:
>
>> I'm working on a poc spark job to pull data from a kafka topic with
>> kerberos enabled ( required ) brokers.
>>
>> The code seems to connect to kafka and enter a polling mode. When I toss
>> something onto the topic I get an exception which I just can't seem to
>> figure out. Any ideas?
>>
>> I have a full gist up at https://gist.github.com/bil
>> sch/17f4a4c4303ed3e004e2234a5904f0de with a lot of details. If I use the
>> hdfs/spark client code for just normal operations everything works fine but
>> for some reason the streaming code is having issues. I have verified the
>> KafkaClient object is in the jaas config. The keytab is good etc.
>>
>> Guessing I'm doing something wrong I just have not figured out what yet!
>> Any thoughts?
>>
>> The exception:
>>
>> 17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> host5.some.org.net): org.apache.kafka.common.KafkaException: Failed to
>> construct kafka consumer
>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>> Consumer.java:702)
>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>> Consumer.java:557)
>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>> Consumer.java:540)
>> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.> t>(CachedKafkaConsumer.scala:47)
>> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get
>> (CachedKafkaConsumer.scala:157)
>> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterato
>> r.(KafkaRDD.scala:210)
>> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRD
>> D.scala:185)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> at org.apache.spark.scheduler.Task.run(Task.scala:86)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.kafka.common.KafkaException:
>> org.apache.kafka.common.KafkaException: Jaas configuration not found
>> at org.apache.kafka.common.network.SaslChannelBuilder.configure
>> (SaslChannelBuilder.java:86)
>> at org.apache.kafka.common.network.ChannelBuilders.create(
>> ChannelBuilders.java:70)
>> at org.apache.kafka.clients.ClientUtils.createChannelBuilder(Cl
>> ientUtils.java:83)
>> at org.apache.kafka.clients.consumer.KafkaConsumer.(Kafka
>> Consumer.java:623)
>> ... 14 more
>> Caused by: org.apache.kafka.common.KafkaException: Jaas configuration
>> not found
>> at org.apache.kafka.common.security.kerberos.KerberosLogin.
>> getServiceName(KerberosLogin.java:299)
>> at org.apache.kafka.common.security.kerberos.KerberosLogin.
>> configure(KerberosLogin.java:103)
>> at org.apache.kafka.common.security.authenticator.LoginManager.
>> (LoginManager.java:45)
>> at org.apache.kafka.common.security.authenticator.LoginManager.
>> acquireLoginManager(LoginManager.java:68)
>> at org.apache.kafka.common.network.SaslChannelBuilder.configure
>> (SaslChannelBuilder.java:78)
>> ... 17 more
>> Caused by: java.io.IOException: Could not find a 'KafkaClient' entry in
>> this configuration.
>> at org.apache.kafka.common.security.JaasUtils.jaasConfig(JaasUt
>> ils.java:50)
>> at org.apache.kafka.common.security.kerberos.KerberosLogin.
>> getServiceName(KerberosLogin.java:297)
>> ... 21 more
>>
>
>


Re: spark kafka consumer with kerberos

2017-03-31 Thread Saisai Shao
Hi Bill,

The exception is from executor side. From the gist you provided, looks like
the issue is that you only configured java options in driver side, I think
you should also configure this in executor side. You could refer to here (
https://github.com/hortonworks-spark/skc#running-on-a-kerberos-enabled-cluster
).

--files key.conf#key.conf,v.keytab#v.keytab
--driver-java-options "-Djava.security.auth.login.config=./key.conf"
--conf 
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf"


On Fri, Mar 31, 2017 at 1:58 AM, Bill Schwanitz  wrote:

> I'm working on a poc spark job to pull data from a kafka topic with
> kerberos enabled ( required ) brokers.
>
> The code seems to connect to kafka and enter a polling mode. When I toss
> something onto the topic I get an exception which I just can't seem to
> figure out. Any ideas?
>
> I have a full gist up at https://gist.github.com/bilsch/
> 17f4a4c4303ed3e004e2234a5904f0de with a lot of details. If I use the
> hdfs/spark client code for just normal operations everything works fine but
> for some reason the streaming code is having issues. I have verified the
> KafkaClient object is in the jaas config. The keytab is good etc.
>
> Guessing I'm doing something wrong I just have not figured out what yet!
> Any thoughts?
>
> The exception:
>
> 17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> host5.some.org.net): org.apache.kafka.common.KafkaException: Failed to
> construct kafka consumer
> at org.apache.kafka.clients.consumer.KafkaConsumer.(
> KafkaConsumer.java:702)
> at org.apache.kafka.clients.consumer.KafkaConsumer.(
> KafkaConsumer.java:557)
> at org.apache.kafka.clients.consumer.KafkaConsumer.(
> KafkaConsumer.java:540)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.<
> init>(CachedKafkaConsumer.scala:47)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.
> get(CachedKafkaConsumer.scala:157)
> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(
> KafkaRDD.scala:210)
> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(
> KafkaRDD.scala:185)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.kafka.common.KafkaException:
> org.apache.kafka.common.KafkaException: Jaas configuration not found
> at org.apache.kafka.common.network.SaslChannelBuilder.
> configure(SaslChannelBuilder.java:86)
> at org.apache.kafka.common.network.ChannelBuilders.
> create(ChannelBuilders.java:70)
> at org.apache.kafka.clients.ClientUtils.createChannelBuilder(
> ClientUtils.java:83)
> at org.apache.kafka.clients.consumer.KafkaConsumer.(
> KafkaConsumer.java:623)
> ... 14 more
> Caused by: org.apache.kafka.common.KafkaException: Jaas configuration not
> found
> at org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(
> KerberosLogin.java:299)
> at org.apache.kafka.common.security.kerberos.KerberosLogin.configure(
> KerberosLogin.java:103)
> at org.apache.kafka.common.security.authenticator.LoginManager.(
> LoginManager.java:45)
> at org.apache.kafka.common.security.authenticator.LoginManager.
> acquireLoginManager(LoginManager.java:68)
> at org.apache.kafka.common.network.SaslChannelBuilder.
> configure(SaslChannelBuilder.java:78)
> ... 17 more
> Caused by: java.io.IOException: Could not find a 'KafkaClient' entry in
> this configuration.
> at org.apache.kafka.common.security.JaasUtils.jaasConfig(
> JaasUtils.java:50)
> at org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(
> KerberosLogin.java:297)
> ... 21 more
>


Research paper used in GraphX

2017-03-31 Thread Md. Rezaul Karim
Hi All,

Could anyone please tell me which research paper(s) was/were used to
implement the metrics like strongly connected components, page rank,
triangle count, closeness centrality, clustering coefficient etc. in Spark
GrpahX?




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Ph.D. Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: How to PushDown ParquetFilter Spark 2.0.1 dataframe

2017-03-31 Thread Hanumath Rao Maduri
Hello Rahul,

Please try to use df.filter(df("id").isin(1,2))

Thanks,

On Thu, Mar 30, 2017 at 10:45 PM, Rahul Nandi 
wrote:

> Hi,
> I have around 2 million data as parquet file in s3. The file structure is
> somewhat like
> id data
> 1 abc
> 2 cdf
> 3 fas
> Now I want to filter and take the records where the id matches with my
> required Id.
>
> val requiredDataId = Array(1,2) //Might go upto 100s of records.
>
> df.filter(requiredDataId.contains("id"))
>
> This is my use case.
>
> What will be best way to do this in spark 2.0.1 where I can also pushDown
> the filter to parquet?
>
>
>
> Thanks and Regards,
> Rahul
>
>


Predicate not getting pusdhown to PrunedFilterScan

2017-03-31 Thread Hanumath Rao Maduri
Hello All,

I am working on creating a new PrunedFilteredScan operator which has the
ability to execute the predicates pushed to this operator.

However What I observed is that if column with deep in the hierarchy is
used then it is not getting pushed down.

SELECT tom._id, tom.address.city from tom where tom.address.city = "Peter"

Here predicate tom.address.city = "Peter" is not getting pushed down.

However if the first level column name is used then it is getting pushed
down.

SELECT tom._id, tom.address.city from tom where tom.first_name = "Peter"


Please let me know what is the issue in this case.

Thanks,


Re: dataframe filter, unable to bind variable

2017-03-31 Thread shyla deshpande
Works. Thanks Hosur.

On Thu, Mar 30, 2017 at 8:37 PM, hosur narahari  wrote:

> Try lit(fromDate) and lit(toDate). You've to import 
> org.apache.spark.sql.functions.lit
> to use it
>
> On 31 Mar 2017 7:45 a.m., "shyla deshpande" 
> wrote:
>
> The following works
>
> df.filter($"createdate".between("2017-03-20", "2017-03-22"))
>
>
> I would like to pass variables fromdate and todate to the filter
>
>  instead of constants. Unable to get the syntax right. Please help.
>
>
> Thanks
>
>
>