Re: Use SparkContext in Web Application

2018-10-03 Thread Girish Vasmatkar
On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
girish.vasmat...@hotwaxsystems.com> wrote:

> Hi All
>
> We are very early into our Spark days so the following may sound like a
> novice question :) I will try to keep this as short as possible.
>
> We are trying to use Spark to introduce a recommendation engine that can
> be used to provide product recommendations and need help on some design
> decisions before moving forward. Ours is a web application running on
> Tomcat. So far, I have created a simple POC (standalone java program) that
> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
> transformations. I would like to be able to do the following -
>
>
>- Scheduler runs nightly in Tomcat (which it does currently) and reads
>everything from the DB to train/fit the system. This can grow into really
>some large data and everyday we will have new data. Should I just use
>SparkContext here, within my scheduler, to FIT the system? Is this correct
>way to go about this? I am also planning to save the model on S3 which
>should be okay. We also thought on using HDFS. The scheduler's job will be
>just to create model and save the same and be done with it.
>- On the product page, we can then use the saved model to display the
>product recommendations for a particular product.
>- My understanding is that I should be able to use SparkContext here
>in my web application to just load the saved model and use it to derive the
>recommendations. Is this a good design? The problem I see using this
>approach is that the SparkContext does take time to initialize and this may
>cost dearly. Or should we keep SparkContext per web application to use a
>single instance of the same? We can initialize a SparkContext during
>application context initializaion phase.
>
>
> Since I am fairly new to using Spark properly, please help me take
> decision on whether the way I plan to use Spark is the recommended way? I
> have also seen use cases involving kafka tha does communication with Spark,
> but can we not do it directly using Spark Context? I am sure a lot of my
> understanding is wrong, so please feel free to correct me.
>
> Thanks and Regards,
> Girish Vasmatkar
> HotWax Systems
>
>
>
>


Re: Use SparkContext in Web Application

2018-10-03 Thread Girish Vasmatkar
All

Can someone please shed some light on the above query? Any help is greatly
appreciated.

Thanks,
Girish Vasmatkar
HotWax Systems


On Thu, Oct 4, 2018 at 10:25 AM Girish Vasmatkar <
girish.vasmat...@hotwaxsystems.com> wrote:

>
>
> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
> girish.vasmat...@hotwaxsystems.com> wrote:
>
>> Hi All
>>
>> We are very early into our Spark days so the following may sound like a
>> novice question :) I will try to keep this as short as possible.
>>
>> We are trying to use Spark to introduce a recommendation engine that can
>> be used to provide product recommendations and need help on some design
>> decisions before moving forward. Ours is a web application running on
>> Tomcat. So far, I have created a simple POC (standalone java program) that
>> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
>> transformations. I would like to be able to do the following -
>>
>>
>>- Scheduler runs nightly in Tomcat (which it does currently) and
>>reads everything from the DB to train/fit the system. This can grow into
>>really some large data and everyday we will have new data. Should I just
>>use SparkContext here, within my scheduler, to FIT the system? Is this
>>correct way to go about this? I am also planning to save the model on S3
>>which should be okay. We also thought on using HDFS. The scheduler's job
>>will be just to create model and save the same and be done with it.
>>- On the product page, we can then use the saved model to display the
>>product recommendations for a particular product.
>>- My understanding is that I should be able to use SparkContext here
>>in my web application to just load the saved model and use it to derive 
>> the
>>recommendations. Is this a good design? The problem I see using this
>>approach is that the SparkContext does take time to initialize and this 
>> may
>>cost dearly. Or should we keep SparkContext per web application to use a
>>single instance of the same? We can initialize a SparkContext during
>>application context initializaion phase.
>>
>>
>> Since I am fairly new to using Spark properly, please help me take
>> decision on whether the way I plan to use Spark is the recommended way? I
>> have also seen use cases involving kafka tha does communication with Spark,
>> but can we not do it directly using Spark Context? I am sure a lot of my
>> understanding is wrong, so please feel free to correct me.
>>
>> Thanks and Regards,
>> Girish Vasmatkar
>> HotWax Systems
>>
>>
>>
>>


Re: How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

2018-10-03 Thread kathleen li
Not sure what you mean about “raw” Spark sql, but there is one parameter which 
will impact the optimizer choose broadcast join automatically or not :

spark.sql.autoBroadcastJoinThreshold

You can read Spark doc about above parameter setting and using explain to check 
your join using broadcast or not.

Make sure you gather statistics for tables.
 
There is broadcast hint also. Please be aware if the table being broadcasted to 
all worker nodes is fairly big, it will not be a good option always.

Kathleen

Sent from my iPhone

> On Oct 3, 2018, at 4:37 PM, kant kodali  wrote:
> 
> Hi All,
> 
> How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2? 
> 
> Thanks
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

2018-10-03 Thread kant kodali
Hi All,

How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

Thanks


Re: Back to SQL

2018-10-03 Thread Reynold Xin
No we used to have that (for views) but it wasn’t working well enough so we
removed it.

On Wed, Oct 3, 2018 at 6:41 PM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> Is there any known way to go from a Spark SQL Logical Plan (optimised ?)
> Back to a SQL query ?
>
> Regards,
>
> Olivier.
>
-- 
--
excuse the brevity and lower case due to wrist injury


Back to SQL

2018-10-03 Thread Olivier Girardot
Hi everyone,
Is there any known way to go from a Spark SQL Logical Plan (optimised ?)
Back to a SQL query ?

Regards,

Olivier.


Restarting a failed Spark streaming job running on top of a yarn cluster

2018-10-03 Thread jcgarciam
Hi Folks,

We have few spark job streaming jobs running on a yarn cluster, and from
time to time a job need to be restarted (it was killed due to external
reason or others).

Once we submit the new job we are face with the following exception:
 ERROR spark.SparkContext: Failed to add
/mnt/data1/yarn/nm/usercache/spark/appcache/*application_1537885048149_15382*/container_e82_1537885048149_15382_01_01/__app__.jar
to Spark environment
java.io.FileNotFoundException: Jar
/mnt/data1/yarn/nm/usercache/spark/appcache/application_1537885048149_15382/container_e82_1537885048149_15382_01_01/__app__.jar
not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1807)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1835)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:457)

Of course we know that *application_1537885048149_15382* correspond to the
previous job that was killed, and that our Yarn is cleaning up the usercache
directory very often to avoid choking the filesystem with so many unused
file.

However what can you guys recommend for long running jobs that have to be
restarted but the previous context is not available due to the cleanup?


Hope is clear what i meant, if you need more information just ask.

Thanks

JC




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
If it is so, how to update/fix the firewall issue?

On Wed, Oct 3, 2018 at 1:14 PM Jörn Franke  wrote:

> Looks like a firewall issue
>
> Am 03.10.2018 um 09:34 schrieb Aakash Basu :
>
> The stacktrace is below -
>
> ---
>> Py4JJavaError Traceback (most recent call last)
>>  in ()
>> > 1 df = spark.read.load("hdfs://
>> 35.154.242.76:9000/auto-ml/projects/auto-ml-test__8503cdc4-21fc-4fae-87c1-5b879cafff71/data/breast-cancer-wisconsin.csv
>> ")
>> /opt/spark/python/pyspark/sql/readwriter.py in load(self, path, format,
>> schema, **options)
>>  164 self.options(**options)
>>  165 if isinstance(path, basestring):
>> --> 166 return self._df(self._jreader.load(path))
>>  167 elif path is not None:
>>  168 if type(path) != list:
>> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in
>> __call__(self, *args)
>>  1158 answer = self.gateway_client.send_command(command)
>>  1159 return_value = get_return_value(
>> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>>  1161
>>  1162 for temp_arg in temp_args:
>> /opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>>  61 def deco(*a, **kw):
>>  62 try:
>> ---> 63 return f(*a, **kw)
>>  64 except py4j.protocol.Py4JJavaError as e:
>>  65 s = e.java_exception.toString()
>> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in
>> get_return_value(answer, gateway_client, target_id, name)
>>  318 raise Py4JJavaError(
>>  319 "An error occurred while calling {0}{1}{2}.\n".
>> --> 320 format(target_id, ".", name), value)
>>  321 else:
>>  322 raise Py4JError(
>> Py4JJavaError: An error occurred while calling o244.load.
>> : java.net.ConnectException: Call From Sandeeps-MacBook-Pro.local/
>> 192.168.50.188 to ec2-35-154-242-76.ap-south-1.compute.amazonaws.com:9000
>> failed on connection exception: java.net.ConnectException: Connection
>> refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>> at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
>> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>> at
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at
>> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>> at scala.collection.immutable.List.flatMap(List.scala:344)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
>>  at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>> at 

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Jörn Franke
Looks like a firewall issue

> Am 03.10.2018 um 09:34 schrieb Aakash Basu :
> 
> The stacktrace is below -
> 
>> ---
>> Py4JJavaError Traceback (most recent call last)
>>  in ()
>> > 1 df = 
>> spark.read.load("hdfs://35.154.242.76:9000/auto-ml/projects/auto-ml-test__8503cdc4-21fc-4fae-87c1-5b879cafff71/data/breast-cancer-wisconsin.csv")
>> /opt/spark/python/pyspark/sql/readwriter.py in load(self, path, format, 
>> schema, **options)
>>  164 self.options(**options)
>>  165 if isinstance(path, basestring):
>> --> 166 return self._df(self._jreader.load(path))
>>  167 elif path is not None:
>>  168 if type(path) != list:
>> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in 
>> __call__(self, *args)
>>  1158 answer = self.gateway_client.send_command(command)
>>  1159 return_value = get_return_value(
>> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>>  1161 
>>  1162 for temp_arg in temp_args:
>> /opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>>  61 def deco(*a, **kw):
>>  62 try:
>> ---> 63 return f(*a, **kw)
>>  64 except py4j.protocol.Py4JJavaError as e:
>>  65 s = e.java_exception.toString()
>> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in 
>> get_return_value(answer, gateway_client, target_id, name)
>>  318 raise Py4JJavaError(
>>  319 "An error occurred while calling {0}{1}{2}.\n".
>> --> 320 format(target_id, ".", name), value)
>>  321 else:
>>  322 raise Py4JError(
>> Py4JJavaError: An error occurred while calling o244.load.
>> : java.net.ConnectException: Call From 
>> Sandeeps-MacBook-Pro.local/192.168.50.188 to 
>> ec2-35-154-242-76.ap-south-1.compute.amazonaws.com:9000 failed on connection 
>> exception: java.net.ConnectException: Connection refused; For more details 
>> see: http://wiki.apache.org/hadoop/ConnectionRefused
>>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>  at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>  at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>>  at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>  at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>  at 
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>  at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>>  at 
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:498)
>>  at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>>  at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>  at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
>>  at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
>>  at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>>  at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>>  at 
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>  at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>>  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>>  at 
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>>  at 
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>>  at scala.collection.immutable.List.foreach(List.scala:381)
>>  at 
>> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>>  at scala.collection.immutable.List.flatMap(List.scala:344)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
>>  at 
>> 

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
The stacktrace is below -

---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df = spark.read.load("hdfs://
> 35.154.242.76:9000/auto-ml/projects/auto-ml-test__8503cdc4-21fc-4fae-87c1-5b879cafff71/data/breast-cancer-wisconsin.csv
> ")
> /opt/spark/python/pyspark/sql/readwriter.py in load(self, path, format,
> schema, **options)
>  164 self.options(**options)
>  165 if isinstance(path, basestring):
> --> 166 return self._df(self._jreader.load(path))
>  167 elif path is not None:
>  168 if type(path) != list:
> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
>  1158 answer = self.gateway_client.send_command(command)
>  1159 return_value = get_return_value(
> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>  1161
>  1162 for temp_arg in temp_args:
> /opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in
> get_return_value(answer, gateway_client, target_id, name)
>  318 raise Py4JJavaError(
>  319 "An error occurred while calling {0}{1}{2}.\n".
> --> 320 format(target_id, ".", name), value)
>  321 else:
>  322 raise Py4JError(
> Py4JJavaError: An error occurred while calling o244.load.
> : java.net.ConnectException: Call From Sandeeps-MacBook-Pro.local/
> 192.168.50.188 to ec2-35-154-242-76.ap-south-1.compute.amazonaws.com:9000
> failed on connection exception: java.net.ConnectException: Connection
> refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
> at org.apache.hadoop.ipc.Client.call(Client.java:1479)
> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:344)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
>  at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> 

How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
Hi,

I have to read data stored in HDFS of a different machine and needs to be
accessed through Spark for being read.

How to do that? Full HDFS address along with port doesn't seem to work.

Anyone did it before?

Thanks,
AB.