Re: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

2022-02-14 Thread Saurabh Gulati
Hey Karan,
you can get the jar from 
here<https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#non-dataproc_clusters>

From: karan alang 
Sent: 13 February 2022 20:08
To: Gourav Sengupta 
Cc: Holden Karau ; Mich Talebzadeh 
; user @spark 
Subject: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This 
is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what 
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting your 
SPARK jobs from a local laptop or in local mode from a GCP dataproc / system?

If you are submitting the job from your local laptop, there will be performance 
bottlenecks I guess based on the internet bandwidth and volume of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
You can also put the GS access jar with your Spark jars — that’s what the class 
not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit<https://urldefense.com/v3/__https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4BmPwNAw$>


HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4hPaytxY$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file system 
on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse 
https://cloud.google.com/storage/docs/gcs-fuse<https://urldefense.com/v3/__https://cloud.google.com/storage/docs/gcs-fuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4fYDEO3c$>
 . Cloud Storage FUSE is an open source 
FUSE<https://urldefense.com/v3/__http://fuse.sourceforge.net/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4H2bW-18$>
 adapter that allows you to mount Cloud Storage buckets as file systems on 
Linux or macOS systems. You can download gcsfuse from 
here<https://urldefense.com/v3/__https://github.com/GoogleCloudPlatform/gcsfuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4Y3qR8x0$>


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a 
local mount file like /mnt/gs as root and give permission to others to use it.


As a normal user that needs to access gs:// bucket (not as root), use gcsfuse 
to mount it. For example I am mounting a gcs bucket called spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars 
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$>



Disclaimer: Use it at your own risk. Any and all responsib

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster ..
This is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
wrote:

> Hi,
>
> agree with Holden, have faced quite a few issues with FUSE.
>
> Also trying to understand "spark-submit from local" . Are you submitting
> your SPARK jobs from a local laptop or in local mode from a GCP dataproc /
> system?
>
> If you are submitting the job from your local laptop, there will be
> performance bottlenecks I guess based on the internet bandwidth and volume
> of data.
>
> Regards,
> Gourav
>
>
> On Sat, Feb 12, 2022 at 7:12 PM Holden Karau  wrote:
>
>> You can also put the GS access jar with your Spark jars — that’s what the
>> class not found exception is pointing you towards.
>>
>> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> BTW I also answered you in in stackoverflow :
>>>
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>>> wrote:
>>>
>>>> You are trying to access a Google storage bucket gs:// from your local
>>>> host.
>>>>
>>>> It does not see it because spark-submit assumes that it is a local file
>>>> system on the host which is not.
>>>>
>>>> You need to mount gs:// bucket as a local file system.
>>>>
>>>> You can use the tool called gcsfuse
>>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>>> systems. You can download gcsfuse from here
>>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>>
>>>>
>>>> Pretty simple.
>>>>
>>>>
>>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>>> creating a local mount file like /mnt/gs as root and give permission to
>>>> others to use it.
>>>>
>>>>
>>>> As a normal user that needs to access gs:// bucket (not as root), use
>>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>>> spark-jars-karan here
>>>>
>>>>
>>>> Just use the bucket name itself
>>>>
>>>>
>>>> gcsfuse spark-jars-karan /mnt/gs
>>>>
>>>>
>>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>>
>>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>>
>>>> HTH
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 12 Feb 2022 at 04:31, karan alang 
>>>> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I'm trying to access gcp buckets while running spark-submit from
>>>>> local, and running into issues.
>>>>>
>>>>> I'm getting error :
>>>>> ```
>>>>>
>>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>>> library for your platform... using builtin-java classes where applicable
>>>>> Exception in thread "main" 
>>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>>> scheme "gs"
>>>>>
>>>>> ```
>>>>> I tried adding the --conf
>>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>>
>>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>>
>>>>> Details are in stackoverflow :
>>>>>
>>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>>
>>>>> Any ideas on how to fix this ?
>>>>> tia !
>>>>>
>>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Holden,

when you mention - GS Access jar -  which jar is this ?
Can you pls clarify ?

thanks,
Karan Alang

On Sat, Feb 12, 2022 at 11:10 AM Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to access gcp buckets while running spark-submit from local,
>>>> and running into issues.
>>>>
>>>> I'm getting error :
>>>> ```
>>>>
>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>> library for your platform... using builtin-java classes where applicable
>>>> Exception in thread "main" 
>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>> scheme "gs"
>>>>
>>>> ```
>>>> I tried adding the --conf
>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>
>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>
>>>> Details are in stackoverflow :
>>>>
>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>
>>>> Any ideas on how to fix this ?
>>>> tia !
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Thanks, Mich - will check this and update.

regds,
Karan Alang

On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread Mich Talebzadeh
 Putting the GS access jar with Spark jars may technically resolve the
issue of spark-submit  but it is not a recommended practice to create a
local copy of jar files.

The approach that the thread owner adopted by putting the files in Google
cloud bucket is correct. Indeed this is what he states and I quote "I'm
trying to access google buckets, when using spark-submit and running into
issues.,  What needs to be done to debug/fix this". Quote from stack
overflow

Hence the approach adopted is correct. He has created a bucket in GCP
called gs://spark-jars-karan/ and wants to access it. I presume *he wants
to test i*t *locally *(on prem I assume) so he just needs to be able to
access the bucket in GCP remotely. The recommendation of using gcsfuse to
resolve this issue is sound.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 19:10, Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm t

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Gourav Sengupta
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting
your SPARK jobs from a local laptop or in local mode from a GCP dataproc /
system?

If you are submitting the job from your local laptop, there will be
performance bottlenecks I guess based on the internet bandwidth and volume
of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to access gcp buckets while running spark-submit from local,
>>>> and running into issues.
>>>>
>>>> I'm getting error :
>>>> ```
>>>>
>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>> library for your platform... using builtin-java classes where applicable
>>>> Exception in thread "main" 
>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>> scheme "gs"
>>>>
>>>> ```
>>>> I tried adding the --conf
>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>
>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>
>>>> Details are in stackoverflow :
>>>>
>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>
>>>> Any ideas on how to fix this ?
>>>> tia !
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the
class not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
wrote:

> You are trying to access a Google storage bucket gs:// from your local
> host.
>
> It does not see it because spark-submit assumes that it is a local file
> system on the host which is not.
>
> You need to mount gs:// bucket as a local file system.
>
> You can use the tool called gcsfuse
> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is an
> open source FUSE <http://fuse.sourceforge.net/> adapter that allows you
> to mount Cloud Storage buckets as file systems on Linux or macOS systems.
> You can download gcsfuse from here
> <https://github.com/GoogleCloudPlatform/gcsfuse>
>
>
> Pretty simple.
>
>
> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
> a local mount file like /mnt/gs as root and give permission to others to
> use it.
>
>
> As a normal user that needs to access gs:// bucket (not as root), use
> gcsfuse to mount it. For example I am mounting a gcs bucket called
> spark-jars-karan here
>
>
> Just use the bucket name itself
>
>
> gcsfuse spark-jars-karan /mnt/gs
>
>
> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>
> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>
> HTH
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>
>> Hello All,
>>
>> I'm trying to access gcp buckets while running spark-submit from local,
>> and running into issues.
>>
>> I'm getting error :
>> ```
>>
>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> Exception in thread "main" 
>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>> scheme "gs"
>>
>> ```
>> I tried adding the --conf
>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>
>> to the spark-submit command, but getting ClassNotFoundException
>>
>> Details are in stackoverflow :
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> Any ideas on how to fix this ?
>> tia !
>>
>>


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file
system on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse
https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is an
open source FUSE <http://fuse.sourceforge.net/> adapter that allows you to
mount Cloud Storage buckets as file systems on Linux or macOS systems. You
can download gcsfuse from here
<https://github.com/GoogleCloudPlatform/gcsfuse>


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a
local mount file like /mnt/gs as root and give permission to others to use
it.


As a normal user that needs to access gs:// bucket (not as root), use
gcsfuse to mount it. For example I am mounting a gcs bucket called
spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:

> Hello All,
>
> I'm trying to access gcp buckets while running spark-submit from local,
> and running into issues.
>
> I'm getting error :
> ```
>
> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" 
> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme 
> "gs"
>
> ```
> I tried adding the --conf
> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>
> to the spark-submit command, but getting ClassNotFoundException
>
> Details are in stackoverflow :
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> Any ideas on how to fix this ?
> tia !
>
>


Unable to access Google buckets using spark-submit

2022-02-11 Thread karan alang
Hello All,

I'm trying to access gcp buckets while running spark-submit from local, and
running into issues.

I'm getting error :
```

22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
Exception in thread "main"
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
scheme "gs"

```
I tried adding the --conf
spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

to the spark-submit command, but getting ClassNotFoundException

Details are in stackoverflow :
https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

Any ideas on how to fix this ?
tia !