Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Marco Mistroni
Hello
  i believe i followed instructions here to get Spark to work on Windows.
The article refers to Win7, but it will work for win10 as well

http://nishutayaltech.blogspot.co.uk/2015/04/how-to-run-apache-spark-on-windows7-in.html

Jagat posted a similar link on winutils...i believe it would probably
say the same as it says here
1- download winutils and place it somehwere inyour file system
2- in your environment settings, ste HADOOP_HOME=


This should get you sorted.
Btw, i got the impression , from what i have seen , that Spark and Windows
aren't best friends. you'd better get a Docker container and run spark
off that container...

hth
 marco







On Fri, Oct 20, 2017 at 5:57 PM, Aakash Basu 
wrote:

> Hey Marco/Jagat,
>
> As I earlier informed you, that I've already done those basic checks and
> permission changes.
>
> eg: D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive, but to no avail.
> It still throws the same error. At the very first place, I do not
> understand, without any manual change, how did the permissions change
> automatically?
>
> To Jagat's question - "Do you have winutils in your system relevant for
> your system." - How to understand that? I did not find winutils specific to
> OS/bits.
>
> Any other solutions? Should I download the fresh zip of Spark and redo all
> the steps of configuring? The chmod is just not working (without any errors
> while submitting the above command).
>
>
> Thanks,
> Aakash.
>
> On Fri, Oct 20, 2017 at 9:53 PM, Jagat Singh  wrote:
>
>> Do you have winutils in your system relevant for your system.
>>
>> This SO post has infomation related https://stackoverflow.
>> com/questions/34196302/the-root-scratch-dir-tmp-hive-on-hdfs
>> -should-be-writable-current-permissions
>>
>>
>>
>> On 21 October 2017 at 03:16, Marco Mistroni  wrote:
>>
>>> Did u build spark or download the zip?
>>> I remember having similar issue...either you have to give write perm to
>>> your /tmp directory or there's a spark config you need to override
>>> This error is not 2.1 specific...let me get home and check my configs
>>> I think I amended my /tmp permissions via xterm instead of control panel
>>>
>>> Hth
>>>  Marco
>>>
>>>
>>> On Oct 20, 2017 8:31 AM, "Aakash Basu" 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I have Spark 2.1 installed in my laptop where I used to run all my
>>> programs. PySpark wasn't used for around 1 month, and after starting it
>>> now, I'm getting this exception (I've tried the solutions I could find on
>>> Google, but to no avail).
>>>
>>> Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.
>>>
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>> o27.sessionState.
>>> : java.lang.IllegalArgumentException: Error while instantiating
>>> 'org.apache.spark.sql.hive.HiveSessionState':
>>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>>> kSession$$reflect(SparkSession.scala:981)
>>> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(Sp
>>> arkSession.scala:110)
>>> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.
>>> scala:109)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:62)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.jav
>>> a:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>>> ava:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>> ConstructorAccessorImpl.java:62)
>>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>> legatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>>> 23)
>>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>>> kSession$$reflect(SparkSession.scala:978)
>>> ... 13 more
>>> Caused by: java.lang.IllegalArgumentException: Error while
>>> instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
>>> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
>>> sql$internal$SharedState$$reflect(SharedState.scala:169)
>>> at 

Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hey Marco/Jagat,

As I earlier informed you, that I've already done those basic checks and
permission changes.

eg: D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive, but to no avail. It
still throws the same error. At the very first place, I do not understand,
without any manual change, how did the permissions change automatically?

To Jagat's question - "Do you have winutils in your system relevant for
your system." - How to understand that? I did not find winutils specific to
OS/bits.

Any other solutions? Should I download the fresh zip of Spark and redo all
the steps of configuring? The chmod is just not working (without any errors
while submitting the above command).


Thanks,
Aakash.

On Fri, Oct 20, 2017 at 9:53 PM, Jagat Singh  wrote:

> Do you have winutils in your system relevant for your system.
>
> This SO post has infomation related https://stackoverflow.
> com/questions/34196302/the-root-scratch-dir-tmp-hive-on-
> hdfs-should-be-writable-current-permissions
>
>
>
> On 21 October 2017 at 03:16, Marco Mistroni  wrote:
>
>> Did u build spark or download the zip?
>> I remember having similar issue...either you have to give write perm to
>> your /tmp directory or there's a spark config you need to override
>> This error is not 2.1 specific...let me get home and check my configs
>> I think I amended my /tmp permissions via xterm instead of control panel
>>
>> Hth
>>  Marco
>>
>>
>> On Oct 20, 2017 8:31 AM, "Aakash Basu" 
>> wrote:
>>
>> Hi all,
>>
>> I have Spark 2.1 installed in my laptop where I used to run all my
>> programs. PySpark wasn't used for around 1 month, and after starting it
>> now, I'm getting this exception (I've tried the solutions I could find on
>> Google, but to no avail).
>>
>> Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.
>>
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o27.sessionState.
>> : java.lang.IllegalArgumentException: Error while instantiating
>> 'org.apache.spark.sql.hive.HiveSessionState':
>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>> kSession$$reflect(SparkSession.scala:981)
>> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(Sp
>> arkSession.scala:110)
>> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.
>> scala:109)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.jav
>> a:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
>> ava:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>> 23)
>> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$Spar
>> kSession$$reflect(SparkSession.scala:978)
>> ... 13 more
>> Caused by: java.lang.IllegalArgumentException: Error while instantiating
>> 'org.apache.spark.sql.hive.HiveExternalCatalog':
>> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
>> sql$internal$SharedState$$reflect(SharedState.scala:169)
>> at org.apache.spark.sql.internal.SharedState.(SharedState
>> .scala:86)
>> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.app
>> ly(SparkSession.scala:101)
>> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.app
>> ly(SparkSession.scala:101)
>> at scala.Option.getOrElse(Option.scala:121)
>> at org.apache.spark.sql.SparkSession.sharedState$lzycompute(Spa
>> rkSession.scala:101)
>> at org.apache.spark.sql.SparkSession.sharedState(SparkSession.s
>> cala:100)
>> at org.apache.spark.sql.internal.SessionState.(SessionSta
>> te.scala:157)
>> at org.apache.spark.sql.hive.HiveSessionState.(HiveSessio
>> nState.scala:32)
>> ... 18 more
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at 

Re: Write to HDFS

2017-10-20 Thread Deepak Sharma
Better use coalesce instead of repatition

On Fri, Oct 20, 2017 at 9:47 PM, Marco Mistroni  wrote:

> Use  counts.repartition(1).save..
> Hth
>
>
> On Oct 20, 2017 3:01 PM, "Uğur Sopaoğlu"  wrote:
>
> Actually, when I run following code,
>
>   val textFile = sc.textFile("Sample.txt")
>   val counts = textFile.flatMap(line => line.split(" "))
>  .map(word => (word, 1))
>  .reduceByKey(_ + _)
>
>
> It save the results into more than one partition like part-0,
> part-1. I want to collect all of them into one file.
>
>
> 2017-10-20 16:43 GMT+03:00 Marco Mistroni :
>
>> Hi
>>  Could you just create an rdd/df out of what you want to save and store
>> it in hdfs?
>> Hth
>>
>> On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu"  wrote:
>>
>>> Hi all,
>>>
>>> In word count example,
>>>
>>>   val textFile = sc.textFile("Sample.txt")
>>>   val counts = textFile.flatMap(line => line.split(" "))
>>>  .map(word => (word, 1))
>>>  .reduceByKey(_ + _)
>>>  counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>>
>>> I want to write collection of "*counts" *which is used in code above to
>>> HDFS, so
>>>
>>> val x = counts.collect()
>>>
>>> Actually I want to write *x *to HDFS. But spark wants to RDD to write
>>> sometihng to HDFS
>>>
>>> How can I write Array[(String,Int)] to HDFS
>>>
>>>
>>> --
>>> Uğur
>>>
>>
>
>
> --
> Uğur Sopaoğlu
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Jagat Singh
Do you have winutils in your system relevant for your system.

This SO post has infomation related
https://stackoverflow.com/questions/34196302/the-root-scratch-dir-tmp-hive-on-hdfs-should-be-writable-current-permissions



On 21 October 2017 at 03:16, Marco Mistroni  wrote:

> Did u build spark or download the zip?
> I remember having similar issue...either you have to give write perm to
> your /tmp directory or there's a spark config you need to override
> This error is not 2.1 specific...let me get home and check my configs
> I think I amended my /tmp permissions via xterm instead of control panel
>
> Hth
>  Marco
>
>
> On Oct 20, 2017 8:31 AM, "Aakash Basu"  wrote:
>
> Hi all,
>
> I have Spark 2.1 installed in my laptop where I used to run all my
> programs. PySpark wasn't used for around 1 month, and after starting it
> now, I'm getting this exception (I've tried the solutions I could find on
> Google, but to no avail).
>
> Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o27.sessionState.
> : java.lang.IllegalArgumentException: Error while instantiating
> 'org.apache.spark.sql.hive.HiveSessionState':
> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$
> SparkSession$$reflect(SparkSession.scala:981)
> at org.apache.spark.sql.SparkSession.sessionState$lzycompute(
> SparkSession.scala:110)
> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.
> scala:109)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.
> java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
> ConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
> legatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$
> SparkSession$$reflect(SparkSession.scala:978)
> ... 13 more
> Caused by: java.lang.IllegalArgumentException: Error while instantiating
> 'org.apache.spark.sql.hive.HiveExternalCatalog':
> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
> sql$internal$SharedState$$reflect(SharedState.scala:169)
> at org.apache.spark.sql.internal.SharedState.(SharedState
> .scala:86)
> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.
> apply(SparkSession.scala:101)
> at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.
> apply(SparkSession.scala:101)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession.sharedState$lzycompute(
> SparkSession.scala:101)
> at org.apache.spark.sql.SparkSession.sharedState(SparkSession.
> scala:100)
> at org.apache.spark.sql.internal.SessionState.(SessionSta
> te.scala:157)
> at org.apache.spark.sql.hive.HiveSessionState.(HiveSessio
> nState.scala:32)
> ... 18 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
> ConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
> legatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
> sql$internal$SharedState$$reflect(SharedState.scala:166)
> ... 26 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
> ConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
> legatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread lucas.g...@gmail.com
Right, that makes sense and I understood that.

The thing I'm wondering about (And i think the answer is 'no' at this
stage).

When the optimizer is running and pushing predicates down, does it take
into account indexing and other storage layer strategies in determining
which predicates are processed in memory and which predicates are pushed to
storage.

Thanks!

Gary Lucas


On 20 October 2017 at 07:32, Mich Talebzadeh 
wrote:

> here below Gary
>
> filtered_df = spark.hiveContext.sql("""
> SELECT
> *
> FROM
> df
> WHERE
> type = 'type'
> AND action = 'action'
> AND audited_changes LIKE '---\ncompany_id:\n- %'
> """)
> filtered_audits.registerTempTable("filtered_df")
>
>
> you are using hql to read data from your temporary table "df" and then
> creating a temporary table on the subset of that temptable "df".
>
> What is the  purpose of it?
>
> When you are within Spark itself data is read in. Granted the indexes on
> RDBMS help reading data through the JDBC connection but do not play any
> role later in running the sal in hql.
>
> Does that make sense?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 October 2017 at 00:04, lucas.g...@gmail.com 
> wrote:
>
>> Ok, so when Spark is forming queries it's ignorant of the underlying
>> storage layer index.
>>
>> If there is an index on a table Spark doesn't take that into account when
>> doing the predicate push down in optimization. In that case why does spark
>> push 2 of my conditions (where fieldx = 'action') to the database but then
>> do the like in memory.  Is that just a function a straightforward LIKE's
>> are done in memory and simple equalities are pushed to the storage layer?
>>
>> remember your indexes are in RDBMS
>>
>>
>> Exactly what I'm asking about, when spark issues the query via the JDBC
>> reader, that query is / is not ignorant of the underlying indexes?  How
>> does spark determine which predicates to perform in the RDD and which
>> predicates to execute in the storage layer?  I guess I should just dig out
>> the JDBC data-frame reader code and see if I can make sense of that?  Or is
>> the predicate push-down stage independent of the readers?
>>
>> Thanks for helping me form a more accurate question!
>>
>> Gary
>>
>>
>>
>> On 19 October 2017 at 15:46, Mich Talebzadeh 
>> wrote:
>>
>>> remember your indexes are in RDBMS. In this case MySQL. When you are
>>> reading from that table you have an 'id' column which I assume is an
>>> integer and you are making parallel threads through JDBC connection to that
>>> table. You can see the threads in MySQL if you query it. You can see
>>> multiple threads. You stated numPartitions but MySQL will decide how many
>>> parallel threads it can handle.
>>>
>>> So data is read into Spark to RDDs and you can se that through SPAK GUI
>>> (port 4040 by default). Then you create a DataFrame (DF) and convert it
>>> into a tempTable. tempTable will not have any indexes. This is happening in
>>> Spark space not MySQL. Once you start reading in your query and collect
>>> data then it will try to cache data in Spark memory. You can see this again
>>> through Spark GUI. You can see the optimizer by using explain() function.
>>> You will see that no index is used.
>>>
>>> Spark uses distributed data in memory to optimize the work. It does not
>>> use any index. In RDBMS an index is an ordered set of column or columns
>>> stored on the disk in B-tree format to improve the query where needed.
>>> Spark tempTable does not follow that method. So in summary your tempTable
>>> will benefit from more executors and memory if you want to improve the
>>> query performance.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 19 October 2017 

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
Use  counts.repartition(1).save..
Hth

On Oct 20, 2017 3:01 PM, "Uğur Sopaoğlu"  wrote:

Actually, when I run following code,

  val textFile = sc.textFile("Sample.txt")
  val counts = textFile.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _)


It save the results into more than one partition like part-0,
part-1. I want to collect all of them into one file.


2017-10-20 16:43 GMT+03:00 Marco Mistroni :

> Hi
>  Could you just create an rdd/df out of what you want to save and store it
> in hdfs?
> Hth
>
> On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu"  wrote:
>
>> Hi all,
>>
>> In word count example,
>>
>>   val textFile = sc.textFile("Sample.txt")
>>   val counts = textFile.flatMap(line => line.split(" "))
>>  .map(word => (word, 1))
>>  .reduceByKey(_ + _)
>>  counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>
>> I want to write collection of "*counts" *which is used in code above to
>> HDFS, so
>>
>> val x = counts.collect()
>>
>> Actually I want to write *x *to HDFS. But spark wants to RDD to write
>> sometihng to HDFS
>>
>> How can I write Array[(String,Int)] to HDFS
>>
>>
>> --
>> Uğur
>>
>


-- 
Uğur Sopaoğlu


Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Marco Mistroni
Did u build spark or download the zip?
I remember having similar issue...either you have to give write perm to
your /tmp directory or there's a spark config you need to override
This error is not 2.1 specific...let me get home and check my configs
I think I amended my /tmp permissions via xterm instead of control panel

Hth
 Marco


On Oct 20, 2017 8:31 AM, "Aakash Basu"  wrote:

Hi all,

I have Spark 2.1 installed in my laptop where I used to run all my
programs. PySpark wasn't used for around 1 month, and after starting it
now, I'm getting this exception (I've tried the solutions I could find on
Google, but to no avail).

Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.


py4j.protocol.Py4JJavaError: An error occurred while calling
o27.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$
spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$
lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(
SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(
ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$
spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 13 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
sql$internal$SharedState$$reflect(SharedState.scala:169)
at org.apache.spark.sql.internal.SharedState.(
SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$
lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(
SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.(
SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.(
HiveSessionState.scala:32)
... 18 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
sql$internal$SharedState$$reflect(SharedState.scala:166)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.
createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
HiveUtils.scala:358)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
HiveUtils.scala:262)
at org.apache.spark.sql.hive.HiveExternalCatalog.(
HiveExternalCatalog.scala:66)
... 31 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root
scratch dir: 

Fwd: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi,

Any help please? What can be the issue?

Thanks,
Aakash.
-- Forwarded message --
From: Aakash Basu 
Date: Fri, Oct 20, 2017 at 1:00 PM
Subject: PySpark 2.1 Not instantiating properly
To: user 


Hi all,

I have Spark 2.1 installed in my laptop where I used to run all my
programs. PySpark wasn't used for around 1 month, and after starting it
now, I'm getting this exception (I've tried the solutions I could find on
Google, but to no avail).

Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.


py4j.protocol.Py4JJavaError: An error occurred while calling
o27.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$
spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$
lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(
SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(
ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$
spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 13 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
sql$internal$SharedState$$reflect(SharedState.scala:169)
at org.apache.spark.sql.internal.SharedState.(
SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(
SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$
lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(
SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.(
SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.(
HiveSessionState.scala:32)
... 18 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$
sql$internal$SharedState$$reflect(SharedState.scala:166)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.
createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
HiveUtils.scala:358)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(
HiveUtils.scala:262)
at org.apache.spark.sql.hive.HiveExternalCatalog.(
HiveExternalCatalog.scala:66)
... 31 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root
scratch dir: /tmp/hive on HDFS should be writable. Current permissions are:
rw-rw-rw-
at org.apache.hadoop.hive.ql.session.SessionState.start(

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread Mich Talebzadeh
here below Gary

filtered_df = spark.hiveContext.sql("""
SELECT
*
FROM
df
WHERE
type = 'type'
AND action = 'action'
AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


you are using hql to read data from your temporary table "df" and then
creating a temporary table on the subset of that temptable "df".

What is the  purpose of it?

When you are within Spark itself data is read in. Granted the indexes on
RDBMS help reading data through the JDBC connection but do not play any
role later in running the sal in hql.

Does that make sense?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 20 October 2017 at 00:04, lucas.g...@gmail.com 
wrote:

> Ok, so when Spark is forming queries it's ignorant of the underlying
> storage layer index.
>
> If there is an index on a table Spark doesn't take that into account when
> doing the predicate push down in optimization. In that case why does spark
> push 2 of my conditions (where fieldx = 'action') to the database but then
> do the like in memory.  Is that just a function a straightforward LIKE's
> are done in memory and simple equalities are pushed to the storage layer?
>
> remember your indexes are in RDBMS
>
>
> Exactly what I'm asking about, when spark issues the query via the JDBC
> reader, that query is / is not ignorant of the underlying indexes?  How
> does spark determine which predicates to perform in the RDD and which
> predicates to execute in the storage layer?  I guess I should just dig out
> the JDBC data-frame reader code and see if I can make sense of that?  Or is
> the predicate push-down stage independent of the readers?
>
> Thanks for helping me form a more accurate question!
>
> Gary
>
>
>
> On 19 October 2017 at 15:46, Mich Talebzadeh 
> wrote:
>
>> remember your indexes are in RDBMS. In this case MySQL. When you are
>> reading from that table you have an 'id' column which I assume is an
>> integer and you are making parallel threads through JDBC connection to that
>> table. You can see the threads in MySQL if you query it. You can see
>> multiple threads. You stated numPartitions but MySQL will decide how many
>> parallel threads it can handle.
>>
>> So data is read into Spark to RDDs and you can se that through SPAK GUI
>> (port 4040 by default). Then you create a DataFrame (DF) and convert it
>> into a tempTable. tempTable will not have any indexes. This is happening in
>> Spark space not MySQL. Once you start reading in your query and collect
>> data then it will try to cache data in Spark memory. You can see this again
>> through Spark GUI. You can see the optimizer by using explain() function.
>> You will see that no index is used.
>>
>> Spark uses distributed data in memory to optimize the work. It does not
>> use any index. In RDBMS an index is an ordered set of column or columns
>> stored on the disk in B-tree format to improve the query where needed.
>> Spark tempTable does not follow that method. So in summary your tempTable
>> will benefit from more executors and memory if you want to improve the
>> query performance.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 19 October 2017 at 23:29, lucas.g...@gmail.com 
>> wrote:
>>
>>> If the underlying table(s) have indexes on them.  Does spark use those
>>> indexes to optimize the query?
>>>
>>> IE if I had a table in my JDBC data source (mysql in this case) had
>>> several indexes and my query was filtering on one of the fields with an
>>> index.  Would spark know to push that predicate to the database or is the
>>> predicate push-down ignorant of the underlying storage layer details.
>>>
>>> Apologies if that still doesn't adequately explain my question.
>>>
>>> Gary Lucas
>>>
>>> On 19 October 2017 at 15:19, Mich 

Re: Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
Actually, when I run following code,

  val textFile = sc.textFile("Sample.txt")
  val counts = textFile.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _)


It save the results into more than one partition like part-0,
part-1. I want to collect all of them into one file.


2017-10-20 16:43 GMT+03:00 Marco Mistroni :

> Hi
>  Could you just create an rdd/df out of what you want to save and store it
> in hdfs?
> Hth
>
> On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu"  wrote:
>
>> Hi all,
>>
>> In word count example,
>>
>>   val textFile = sc.textFile("Sample.txt")
>>   val counts = textFile.flatMap(line => line.split(" "))
>>  .map(word => (word, 1))
>>  .reduceByKey(_ + _)
>>  counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>
>> I want to write collection of "*counts" *which is used in code above to
>> HDFS, so
>>
>> val x = counts.collect()
>>
>> Actually I want to write *x *to HDFS. But spark wants to RDD to write
>> sometihng to HDFS
>>
>> How can I write Array[(String,Int)] to HDFS
>>
>>
>> --
>> Uğur
>>
>


-- 
Uğur Sopaoğlu


Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
Hi
 Could you just create an rdd/df out of what you want to save and store it
in hdfs?
Hth

On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu"  wrote:

> Hi all,
>
> In word count example,
>
>   val textFile = sc.textFile("Sample.txt")
>   val counts = textFile.flatMap(line => line.split(" "))
>  .map(word => (word, 1))
>  .reduceByKey(_ + _)
>  counts.saveAsTextFile("hdfs://master:8020/user/abc")
>
> I want to write collection of "*counts" *which is used in code above to
> HDFS, so
>
> val x = counts.collect()
>
> Actually I want to write *x *to HDFS. But spark wants to RDD to write
> sometihng to HDFS
>
> How can I write Array[(String,Int)] to HDFS
>
>
> --
> Uğur
>


Re: Is Spark suited for this use case?

2017-10-20 Thread JG Perrin
I have seen a similar scenario where we load data from a RDBMS into a NoSQL 
database… Spark made sense for velocity and parallel processing (and cost of 
licenses :) ).
 
> On Oct 15, 2017, at 21:29, Saravanan Thirumalai 
>  wrote:
> 
> We are an Investment firm and have a MDM platform in oracle at a vendor 
> location and use Oracle Golden Gate to replicat data to our data center for 
> reporting needs. 
> Our data is not big data (total size 6 TB including 2 TB of archive data). 
> Moreover our data doesn't get updated often, nightly once (around 50 MB) and 
> some correction transactions during the day (<10 MB). We don't have external 
> users and hence data doesn't grow real-time like e-commerce.
> 
> When we replicate data from source to target, we transfer data through files. 
> So, if there are DML operations (corrections) during day time on a source 
> table, the corresponding file would have probably 100 lines of table data 
> that needs to be loaded into the target database. Due to low volume of data 
> we designed this through Informatica and this works in less than 2-5 minutes. 
> Can Spark be used in this case or would it be an overkill of technology use?
> 
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Java Rdd of String to dataframe

2017-10-20 Thread JG Perrin
SK,

Have you  considered:   
Dataset df = spark.read().json(dfWithStringRowsContainingJson);

jg

> On Oct 11, 2017, at 16:35, sk skk  wrote:
> 
> Can we create a dataframe from a Java pair rdd of String . I don’t have a 
> schema as it will be a dynamic Json. I gave encoders.string class.
> 
> Any help is appreciated !!
> 
> Thanks,
> SK



Re: Prediction using Classification with text attributes in Apache Spark MLLib

2017-10-20 Thread lmk
Trying to improve the old solution. 
Do we have a better text classifier now in Spark Mllib?

Regards,
lmk



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
Hi all,

In word count example,

  val textFile = sc.textFile("Sample.txt")
  val counts = textFile.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _)
 counts.saveAsTextFile("hdfs://master:8020/user/abc")

I want to write collection of "*counts" *which is used in code above to
HDFS, so

val x = counts.collect()

Actually I want to write *x *to HDFS. But spark wants to RDD to write
sometihng to HDFS

How can I write Array[(String,Int)] to HDFS


-- 
Uğur


PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi all,

I have Spark 2.1 installed in my laptop where I used to run all my
programs. PySpark wasn't used for around 1 month, and after starting it
now, I'm getting this exception (I've tried the solutions I could find on
Google, but to no avail).

Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Windows 10 Pro, 64 Bits.


py4j.protocol.Py4JJavaError: An error occurred while calling
o27.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':
at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 13 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveExternalCatalog':
at
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
at
org.apache.spark.sql.internal.SharedState.(SharedState.scala:86)
at
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
at
org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
at
org.apache.spark.sql.internal.SessionState.(SessionState.scala:157)
at
org.apache.spark.sql.hive.HiveSessionState.(HiveSessionState.scala:32)
... 18 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:358)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
at
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:66)
... 31 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root
scratch dir: /tmp/hive on HDFS should be writable. Current permissions are:
rw-rw-rw-
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:188)
... 39 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on
HDFS should be writable. Current permissions are: rw-rw-rw-
at

Re: Metadata Management

2017-10-20 Thread Szuromi Tamás
Hi Vasu,

https://github.com/linkedin/WhereHows might be a good fit.

Cheers
Tamas

On 2017. Oct 19., Thu at 23:22, Vasu Gourabathina 
wrote:

> All:
>
> This may be off topic for Spark, but I'm sure several of you might have
> used some form of this as part of your BigData implementations. So, wanted
> to reach out.
>
> As part of the Data Lake and Data Processing (by Spark as an example), we
> might end up different form-factors for the files (via, cleanup, enrichment
> etc).
>
> In order to make this data available for data exploration by analysts,
> data scientists - how to manage the metadata?
>   - Creating Metadata Repository
>   - Make the schemas available for users, so they may use it to create
> Hive tables, use them by Presto etc.
>
> Can you recommend some patterns (or tools) to help manage the Metadata?
> Trying to reduce the dependency on the engineers and make the
> analysts/scientists be self-sufficient as much as possible.
>
> Azure and AWS Glue Data Catalog seem to address this. Any inputs on these
> two?
>
> Appreciate in advance.
>
> Thanks,
> Vasu.
>