Pyspark - Kerberos authentication error

2019-08-07 Thread Pravinkumar vp
Hello, I’m new to pyspark programming and facing Issue when try to connect the 
HDFS with Kerberos auth principal and keytab.

My environment is a docker container environment, so installed necessary 
libraries including pyspark, krb5-* for client

In my scenario, before creating spark context I have done the kinit to get the 
token. After that I’m creating spark context and trying to connect but it’s 
failing with the following error.

Org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not 
enabled. Available: [TOKEN, KERBEROS]

Please let me know what is the issue here.. and also please share some sample 
code to create a spark session to connect the Kerberos enabled cluster with 
principal and keytab

Thanks!!

Get Outlook for iOS


Sharing ideas on using Databricks Delta Lake

2019-08-07 Thread Mich Talebzadeh
I upgraded my Spark to 2.4.3 that allows using the storage layer Delta Lake
 . Actually, I wish
Databricks would have chosen a different name for it :)

Anyhow although most example of storage are on normal file system,
(/tmp/), I managed to put data on hdfs itself. I assume this should
work on any Hadoop Compatible File System (HCFS) like GCP buckets etc?

According to the link above:

Delta Lake  is an open source storage layer
 that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and
unifies streaming and batch data processing. Delta Lake runs on top of your
existing data lake and is fully compatible with Apache Spark APIs.

So in a nutshell with ACID compliance we have got an Oracle type DW on HDFS
with snapshots. So I am thinking loud besides its compatibility with Spark
(which is great), where I can use this product to give me strategic
advantage?

Also how much functional programming this will support. I gather once you
created  DataFrame on top of storage, windowing analytics etc can be used
BAU.

I am sure someone can explain this.

Regards,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Mich Talebzadeh
Have you updated partition statistics by any chance?

I assume you can access the table and data though Hive itself?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Aug 2019 at 21:07, Patrick McCarthy 
wrote:

> Do the permissions on the hive table files on HDFS correspond with what
> the spark user is able to read? This might arise from spark being run as
> different users.
>
> On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade 
> wrote:
>
>> Hi,
>> I did not explicitly create a Hive Context. I have been using the
>> spark.sqlContext that gets created upon launching the spark-shell.
>> Isn't this sqlContext same as the hiveContext?
>> Thanks,
>> Rishikesh
>>
>> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:
>>
>>> Do you use the HiveContext in Spark? Do you configure the same options
>>> there? Can you share some code?
>>>
>>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <
>>> rishikeshg1...@gmail.com>:
>>>
>>> Hi.
>>> I am using Spark 2.3.2 and Hive 3.1.0.
>>> Even if i use parquet files the result would be same, because after all
>>> sparkSQL isn't able to descend into the subdirectories over which the table
>>> is created. Could there be any other way?
>>> Thanks,
>>> Rishikesh
>>>
>>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
>>> wrote:
>>>
 which versions of Spark and Hive are you using.

 what will happen if you use parquet tables instead?

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
 wrote:

> Hi.
> I have built a Hive external table on top of a directory 'A' which has
> data stored in ORC format. This directory has several subdirectories 
> inside
> it, each of which contains the actual ORC files.
> These subdirectories are actually created by spark jobs which ingest
> data from other sources and write it into this directory.
> I tried creating a table and setting the table properties of the same
> as *hive.mapred.supports.subdirectories=TRUE* and
> *mapred.input.dir.recursive**=TRUE*.
> As a result of this, when i fire the simplest query of *select
> count(*) from ExtTable* via the Hive CLI, it successfully gives me
> the expected count of records in the table.
> However, when i fire the same query via sparkSQL, i get count = 0.
>
> I think the sparkSQL isn't able to descend into the subdirectories for
> getting the data while hive is able to do so.
> Are there any configurations needed to be set on the spark side so
> that this works as it does via hive cli?
> I am using Spark on YARN.
>
> Thanks,
> Rishikesh
>
> Tags: subdirectories, subdirectory, recursive, recursion, hive
> external table, orc, sparksql, yarn
>

>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
Do the permissions on the hive table files on HDFS correspond with what the
spark user is able to read? This might arise from spark being run as
different users.

On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade 
wrote:

> Hi,
> I did not explicitly create a Hive Context. I have been using the
> spark.sqlContext that gets created upon launching the spark-shell.
> Isn't this sqlContext same as the hiveContext?
> Thanks,
> Rishikesh
>
> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:
>
>> Do you use the HiveContext in Spark? Do you configure the same options
>> there? Can you share some code?
>>
>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade > >:
>>
>> Hi.
>> I am using Spark 2.3.2 and Hive 3.1.0.
>> Even if i use parquet files the result would be same, because after all
>> sparkSQL isn't able to descend into the subdirectories over which the table
>> is created. Could there be any other way?
>> Thanks,
>> Rishikesh
>>
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
>> wrote:
>>
>>> which versions of Spark and Hive are you using.
>>>
>>> what will happen if you use parquet tables instead?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>>> wrote:
>>>
 Hi.
 I have built a Hive external table on top of a directory 'A' which has
 data stored in ORC format. This directory has several subdirectories inside
 it, each of which contains the actual ORC files.
 These subdirectories are actually created by spark jobs which ingest
 data from other sources and write it into this directory.
 I tried creating a table and setting the table properties of the same
 as *hive.mapred.supports.subdirectories=TRUE* and
 *mapred.input.dir.recursive**=TRUE*.
 As a result of this, when i fire the simplest query of *select
 count(*) from ExtTable* via the Hive CLI, it successfully gives me the
 expected count of records in the table.
 However, when i fire the same query via sparkSQL, i get count = 0.

 I think the sparkSQL isn't able to descend into the subdirectories for
 getting the data while hive is able to do so.
 Are there any configurations needed to be set on the spark side so that
 this works as it does via hive cli?
 I am using Spark on YARN.

 Thanks,
 Rishikesh

 Tags: subdirectories, subdirectory, recursive, recursion, hive external
 table, orc, sparksql, yarn

>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Spark scala/Hive scenario

2019-08-07 Thread Jörn Franke
You can use the map datatype on the Hive table for the columns that are 
uncertain:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes

However, maybe you can share more concrete details, because there could be also 
other solutions.

> Am 07.08.2019 um 20:40 schrieb anbutech :
> 
> Hi All,
> 
> I have a scenario in (Spark scala/Hive):
> 
> Day 1:
> 
> i have a file with 5 columns which needs to be processed and loaded into
> hive tables.
> day2:
> 
> Next day the same feeds(file) has 8 columns(additional fields) which needs
> to be processed and loaded into hive tables
> 
> How do we approach this problem without changing the target table schema.Is
> there any way we can achieve this.
> 
> Thanks
> Anbu
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi,
I did not explicitly create a Hive Context. I have been using the
spark.sqlContext that gets created upon launching the spark-shell.
Isn't this sqlContext same as the hiveContext?
Thanks,
Rishikesh

On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:

> Do you use the HiveContext in Spark? Do you configure the same options
> there? Can you share some code?
>
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade  >:
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL isn't able to descend into the subdirectories over which the table
> is created. Could there be any other way?
> Thanks,
> Rishikesh
>
> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
> wrote:
>
>> which versions of Spark and Hive are you using.
>>
>> what will happen if you use parquet tables instead?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>> wrote:
>>
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has
>>> data stored in ORC format. This directory has several subdirectories inside
>>> it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest
>>> data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as
>>> *hive.mapred.supports.subdirectories=TRUE* and
>>> *mapred.input.dir.recursive**=TRUE*.
>>> As a result of this, when i fire the simplest query of *select count(*)
>>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>>> count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>
>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that
>>> this works as it does via hive cli?
>>> I am using Spark on YARN.
>>>
>>> Thanks,
>>> Rishikesh
>>>
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>> table, orc, sparksql, yarn
>>>
>>


Spark scala/Hive scenario

2019-08-07 Thread anbutech
Hi All,

I have a scenario in (Spark scala/Hive):

Day 1:

i have a file with 5 columns which needs to be processed and loaded into
hive tables.
day2:

Next day the same feeds(file) has 8 columns(additional fields) which needs
to be processed and loaded into hive tables

How do we approach this problem without changing the target table schema.Is
there any way we can achieve this.

Thanks
Anbu



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-07 Thread Hao Ren
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by
day, hour, platform, request_status and is_sampled. The underlying data is
in parquet format on HDFS.
Here is the SQL query to read just *one partition*.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
...
```

It seems the job is reading all of the partitions of the table and the job
takes too long for just one partition. One workaround is using
`spark.read.parquet` API to read parquet files directly. Spark has
partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage
partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

-- 
Hao Ren


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? 
Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade :
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all 
> sparkSQL isn't able to descend into the subdirectories over which the table 
> is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh  
>> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade  
>>> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data 
>>> stored in ORC format. This directory has several subdirectories inside it, 
>>> each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data 
>>> from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as 
>>> hive.mapred.supports.subdirectories=TRUE and 
>>> mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from 
>>> ExtTable via the Hive CLI, it successfully gives me the expected count of 
>>> records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for 
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that 
>>> this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external 
>>> table, orc, sparksql, yarn


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi.
I am using Spark 2.3.2 and Hive 3.1.0.
Even if i use parquet files the result would be same, because after all
sparkSQL isn't able to descend into the subdirectories over which the table
is created. Could there be any other way?
Thanks,
Rishikesh

On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
wrote:

> which versions of Spark and Hive are you using.
>
> what will happen if you use parquet tables instead?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
> wrote:
>
>> Hi.
>> I have built a Hive external table on top of a directory 'A' which has
>> data stored in ORC format. This directory has several subdirectories inside
>> it, each of which contains the actual ORC files.
>> These subdirectories are actually created by spark jobs which ingest data
>> from other sources and write it into this directory.
>> I tried creating a table and setting the table properties of the same as
>> *hive.mapred.supports.subdirectories=TRUE* and
>> *mapred.input.dir.recursive**=TRUE*.
>> As a result of this, when i fire the simplest query of *select count(*)
>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>> count of records in the table.
>> However, when i fire the same query via sparkSQL, i get count = 0.
>>
>> I think the sparkSQL isn't able to descend into the subdirectories for
>> getting the data while hive is able to do so.
>> Are there any configurations needed to be set on the spark side so that
>> this works as it does via hive cli?
>> I am using Spark on YARN.
>>
>> Thanks,
>> Rishikesh
>>
>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>> table, orc, sparksql, yarn
>>
>