alternatives for long to longwritable typecasting in spark sql

2017-01-30 Thread Alex
Hi Guys

Please let me know if any other ways to typecast as below is throwing error
unable to typecast java.lang Long to Longwritable and same for Double for
Text also in spark -sql  Below piece of code is from hive udf which i am
trying to run in spark-sql




public Object get(Object name) {
  int pos = getPos((String)name);
  if(pos<0) return null;
  String f = "string";
  Object obj= list.get(pos);
  if(obj==null) return null;
  ObjectInspector ins = ((StructField)colnames.get(
pos)).getFieldObjectInspector();
  if(ins!=null) f = ins.getTypeName();
  switch (f) {
case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
default  :  return obj;
  }
}


does both below code do the same thing? I had to refactor code to fit in spark-sql

2017-01-30 Thread Alex
public Object get(Object name) {
int pos = getPos((String) name);
if (pos < 0)
return null;
String f = "string";
Object obj = list.get(pos);
Object result = null;
if (obj == null)
return null;
ObjectInspector ins = ((StructField)
colnames.get(pos)).getFieldObjectInspector();
if (ins != null)
f = ins.getTypeName();

PrimitiveObjectInspector ins2 =
(PrimitiveObjectInspector) ins;
switch (ins2.getPrimitiveCategory()) {
case DOUBLE:

Double res =
(Double)(((DoubleObjectInspector) ins2).get(obj));
result = (double) res;
return result;


case LONG:

Long res1 = (Long)(((LongObjectInspector)
ins2).get(obj));
result = (long) res1;
return result;


case STRING:
result = (((StringObjectInspector)
ins2).getPrimitiveJavaObject(obj)).toString();
return result;

default:
result = obj;
return result;
}

}




Code 2 )


public Object get(Object name) {
  int pos = getPos((String)name);
 if(pos<0) return null;
 String f = "string";
  Object obj= list.get(pos);
 if(obj==null) return null;
 ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
 if(ins!=null) f = ins.getTypeName();
 switch (f) {
   case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
   default  :  return obj;
 }
}


But getting different results in hive and spark


Spark 2.1.0 and Shapeless

2017-01-30 Thread Timothy Chan
I'm using a library, https://github.com/guardian/scanamo, that uses
shapeless 2.3.2. What are my options if I want to use this with Spark
2.1.0?

Based on this:
http://apache-spark-developers-list.1001551.n3.nabble.com/shapeless-in-spark-2-1-0-tt20392.html

I'm guessing I would have to release my own version of scanamo with a
shaded shapeless?


RDD unpersisted still showing in my Storage tab UI

2017-01-30 Thread Saulo Ricci
Hi,

I have a spark streaming application and basically in the end of each batch
processing I call the method unpersist for the batch's RDD. But I've
noticed the RDD's for all past batches are still showing on my Spark's UI
Storage table.

Shouldn't I expect to never see those RDD's again in my Storage table once
I unpersisted them?


Thank you,

-- 
Saulo


Re: Tableau BI on Spark SQL

2017-01-30 Thread Todd Nist
Hi Mich,

You could look at http://www.exasol.com/.  It works very well with Tableau
without the need to extract the data.  Also in V6, it has the virtual
schemas which would allow you to access data in Spark, Hive, Oracle, or
other sources.

May be outside of what you are looking for, it works well for us.  We did
the extract route originally, but with the native Exasol connector it is
just a performant as the extract.

HTH.

-Todd


On Mon, Jan 30, 2017 at 10:15 PM, Jörn Franke  wrote:

> With a lot of data (TB) it is not that good, hence the extraction.
> Otherwise you have to wait every time you do drag and drop. With the
> extracts it is better.
>
> On 30 Jan 2017, at 22:59, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn,
>
> So Tableau uses its own in-memory representation as I guessed. Now the
> question is how is performance accessing data in Oracle tables>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 January 2017 at 21:51, Jörn Franke  wrote:
>
>> Depending on the size of the data i recommend to schedule regularly an
>> extract in tableau. There tableau converts it to an internal in-memory
>> representation outside of Spark (can also exist on disk if memory is too
>> small) and then use it within Tableau. Accessing directly  the database is
>> not so efficient.
>> Additionally use always the newest version of tableau..
>>
>> On 30 Jan 2017, at 21:57, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Has anyone tried using Tableau on Spark SQL?
>>
>> Specifically how does Tableau handle in-memory capabilities of Spark.
>>
>> As I understand Tableau uses its own propriety SQL against say Oracle.
>> That is well established. So for each product Tableau will try to use its
>> own version of SQL against that product  like Spark
>> or Hive.
>>
>> However, when I last tried Tableau on Hive, the mapping and performance
>> was not that good in comparision with the same tables and data in Hive..
>>
>> My approach has been to take Oracle 11.g sh schema
>> containing
>> star schema and create and ingest the same tables and data  into Hive
>> tables. Then run Tableau against these tables and do the performance
>> comparison. Given that Oracle is widely used with Tableau this test makes
>> sense?
>>
>> Thanks.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Cosine Similarity Implementation in Spark

2017-01-30 Thread Manish Tripathi
I have a data frame which has two columns (id, vector (tf-idf)). The first
column signifies the Id of the document while the second column is a
Vector(tf-idf) values.

I want to use DIMSUM for cosine similarity but unfortunately I have Spark
1.x and looks like these methods are implemented only in Spark 2.x onwards
and hence the corresponding cosineSimilarity method for RowMatrix is not
there.

So I thought maybe I can use the cosineSimilarity method of
IndexedRowMatrix object as I see a corresponding cosine similarity method
for IndexedRowMatrix docs.

So here the couple of questions on the same.

1). So how do I first convert my spark data frame to IndexedRowMatrix
format?

2) Does cosine similarity method in IndexedRowMatrix also uses DIMSUM as
cosineSimilarity method of RowMatrix?

3). In RowMatrix, if I use Scala then I do have access to cosine similarity
method there. However , it gives a matrix of similarities with no row
indices (since RowMatrix is a index less matrix). So how do I infer the
cosine similarity of each doc id with other from the output of RowMatrix?

Please advise.

Link to docs on IndexedRowMatrix.

http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix.columnSimilarities
ᐧ


graphframes stateful motif

2017-01-30 Thread geoHeil
Starting out with graph frames I would like to understand stateful motifs
better.
There is a nice example in the documentation. 

How can I explicitly return the counts?

How could it be extended to count
- the friends of each vertex with age > 30
- the percentage of friendsGreater30 / allFriends

http://stackoverflow.com/questions/41946947/spark-graphframes-stateful-motif

cheers,
Georg



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/graphframes-stateful-motif-tp28352.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Examples in graphx

2017-01-30 Thread Ankur Srivastava
The one issue with using Neo4j is that you need to persist the whole graph
on one single machine i.e you can not shard the graph. I am not sure what
is the size of your graph but if it is huge one way to shard could be to
use the Component Id to shard. You can generate Component Id by running
ConnectedComponent on your Graph in GrpahX of GraphFrames.

But GraphX or GraphFrame expect the data in to Dataframes (RDD) vertices
and edges and it really relies on the relational nature of these entities
to run any algorithm. AFAIK same is the case with Giraph too so if you want
to use GraphFrames  as your processing engine you can chose to persist your
data in Hive tables and not in native graph format.

Hope this helps.

Thanks
Ankur

On Sun, Jan 29, 2017 at 10:27 AM, Felix Cheung 
wrote:

> Which graph do you are thinking about?
> Here's one for neo4j
>
> https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/
>
> --
> *From:* Deepak Sharma 
> *Sent:* Sunday, January 29, 2017 4:28:19 AM
> *To:* spark users
> *Subject:* Examples in graphx
>
> Hi There,
> Are there any examples of using GraphX along with any graph DB?
> I am looking to persist the graph in graph based DB and then read it back
> in spark , process using graphx.
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: mapWithState question

2017-01-30 Thread Cody Koeninger
Keep an eye on

https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

although it'll likely be a while

On Mon, Jan 30, 2017 at 3:41 PM, Tathagata Das
 wrote:
> If you care about the semantics of those writes to Kafka, then you should be
> aware of two things.
> 1. There are no transactional writes to Kafka.
> 2. So, when tasks get reexecuted due to any failure, your mapping function
> will also be reexecuted, and the writes to kafka can happen multiple times.
> So you may only get at least once guarantee about those Kafka writes
>
>
> On Mon, Jan 30, 2017 at 10:02 AM, shyla deshpande 
> wrote:
>>
>> Hello,
>>
>> TD, your suggestion works great. Thanks
>>
>> I have 1 more question, I need to write to kafka from within the
>> mapWithState function. Just wanted to check if this a bad pattern in any
>> way.
>>
>> Thank you.
>>
>>
>>
>>
>>
>> On Sat, Jan 28, 2017 at 9:14 AM, shyla deshpande
>>  wrote:
>>>
>>> Thats a great idea. I will try that. Thanks.
>>>
>>> On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das
>>>  wrote:

 1 state object for each user.
 union both streams into a single DStream, and apply mapWithState on it
 to update the user state.

 On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande
  wrote:
>
> Can multiple DStreams manipulate a state? I have a stream that gives me
> total minutes the user spent on a course material. I have another stream
> that gives me chapters completed and lessons completed by the user. I want
> to keep track for each user total_minutes, chapters_completed and
> lessons_completed. I am not sure if I should have 1 state or 2 states. 
> Can I
> lookup the state for a given key just like a map outside the mapfunction?
>
> Appreciate your input. Thanks


>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: mapWithState question

2017-01-30 Thread shyla deshpande
Thanks. Appreciate your input.

On Mon, Jan 30, 2017 at 1:41 PM, Tathagata Das 
wrote:

> If you care about the semantics of those writes to Kafka, then you should
> be aware of two things.
> 1. There are no transactional writes to Kafka.
> 2. So, when tasks get reexecuted due to any failure, your mapping function
> will also be reexecuted, and the writes to kafka can happen multiple times.
> So you may only get at least once guarantee about those Kafka writes
>
>
> On Mon, Jan 30, 2017 at 10:02 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Hello,
>>
>> TD, your suggestion works great. Thanks
>>
>> I have 1 more question, I need to write to kafka from within the
>> mapWithState function. Just wanted to check if this a bad pattern in any
>> way.
>>
>> Thank you.
>>
>>
>>
>>
>>
>> On Sat, Jan 28, 2017 at 9:14 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Thats a great idea. I will try that. Thanks.
>>>
>>> On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 1 state object for each user.
 union both streams into a single DStream, and apply mapWithState on it
 to update the user state.

 On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> Can multiple DStreams manipulate a state? I have a stream that gives
> me total minutes the user spent on a course material. I have another
> stream that gives me chapters completed and lessons completed by the 
> user. I
> want to keep track for each user total_minutes, chapters_completed and
> lessons_completed. I am not sure if I should have 1 state or 2
> states. Can I lookup the state for a given key just like a map
> outside the mapfunction?
>
> Appreciate your input. Thanks
>


>>>
>>
>


Re: kafka structured streaming source refuses to read

2017-01-30 Thread Michael Armbrust
Thanks for for following up!  I've linked the relevant tickets to
SPARK-18057  and I
targeted it for Spark 2.2.

On Sat, Jan 28, 2017 at 10:15 AM, Koert Kuipers  wrote:

> there was also already an existing spark ticket for this:
> SPARK-18779 
>
> On Sat, Jan 28, 2017 at 1:13 PM, Koert Kuipers  wrote:
>
>> it seems the bug is:
>> https://issues.apache.org/jira/browse/KAFKA-4547
>>
>> i would advise everyone not to use kafka-clients 0.10.0.2, 0.10.1.0 or
>> 0.10.1.1
>>
>> On Fri, Jan 27, 2017 at 3:56 PM, Koert Kuipers  wrote:
>>
>>> in case anyone else runs into this:
>>>
>>> the issue is that i was using kafka-clients 0.10.1.1
>>>
>>> it works when i use kafka-clients 0.10.0.1 with spark structured
>>> streaming
>>>
>>> my kafka server is 0.10.1.1
>>>
>>> On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers 
>>> wrote:
>>>
 i checked my topic. it has 5 partitions but all the data is written to
 a single partition: wikipedia-2
 i turned on debug logging and i see this:

 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to
 consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2,
 wikipedia-1]. Seeking to the end.
 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
 partition wikipedia-0
 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
 partition wikipedia-4
 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
 partition wikipedia-3
 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
 partition wikipedia-2
 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
 partition wikipedia-1
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-0 to latest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-0
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-0 to earliest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-0
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-4 to latest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-4
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-4 to earliest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-4
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-3 to latest offset.
 2017-01-27 13:02:50 DEBUG internals.AbstractCoordinator: Received
 successful heartbeat response for group spark-kafka-source-fac4f749-fd
 56-4a32-82c7-e687aadf520b-1923704552-driver-0
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-3
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-3 to earliest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-3
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-2 to latest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=152908} for partition wikipedia-2
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-2 to earliest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-2
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
 partition wikipedia-1 to latest offset.
 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
 offset=0} for partition wikipedia-1
 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Got latest offsets for
 partition : Map(wikipedia-1 -> 0, wikipedia-4 -> 0, wikipedia-2 -> 0,
 wikipedia-3 -> 0, wikipedia-0 -> 0)

 what is confusing to me is this:
 Resetting offset for partition wikipedia-2 to latest offset.
 Fetched {timestamp=-1, offset=152908} for partition wikipedia-2
 Got latest offsets for partition : Map(wikipedia-1 -> 0, wikipedia-4 ->
 0, wikipedia-2 -> 0, wikipedia-3 -> 0, wikipedia-0 -> 0)

 why does it find latest offset 152908 for wikipedia-2 but then sets
 latest offset to 0 for that partition? or am i misunderstanding?

 On Fri, Jan 27, 2017 at 1:19 PM, Koert Kuipers 
 wrote:

> code:
>   val query = spark.readStream
> .format("kafka")
> 

Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
With a lot of data (TB) it is not that good, hence the extraction. Otherwise 
you have to wait every time you do drag and drop. With the extracts it is 
better.

> On 30 Jan 2017, at 22:59, Mich Talebzadeh  wrote:
> 
> Thanks Jorn,
> 
> So Tableau uses its own in-memory representation as I guessed. Now the 
> question is how is performance accessing data in Oracle tables>
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 30 January 2017 at 21:51, Jörn Franke  wrote:
>> Depending on the size of the data i recommend to schedule regularly an 
>> extract in tableau. There tableau converts it to an internal in-memory 
>> representation outside of Spark (can also exist on disk if memory is too 
>> small) and then use it within Tableau. Accessing directly  the database is 
>> not so efficient. 
>> Additionally use always the newest version of tableau..
>> 
>>> On 30 Jan 2017, at 21:57, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> Has anyone tried using Tableau on Spark SQL?
>>> 
>>> Specifically how does Tableau handle in-memory capabilities of Spark.
>>> 
>>> As I understand Tableau uses its own propriety SQL against say Oracle. That 
>>> is well established. So for each product Tableau will try to use its own 
>>> version of SQL against that product  like Spark
>>> or Hive.
>>> 
>>> However, when I last tried Tableau on Hive, the mapping and performance was 
>>> not that good in comparision with the same tables and data in Hive..
>>> 
>>> My approach has been to take Oracle 11.g sh schema containing star schema 
>>> and create and ingest the same tables and data  into Hive tables. Then run 
>>> Tableau against these tables and do the performance comparison. Given that 
>>> Oracle is widely used with Tableau this test makes sense?
>>> 
>>> Thanks.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
> 


Re: Tableau BI on Spark SQL

2017-01-30 Thread Mich Talebzadeh
Thanks Jorn,

So Tableau uses its own in-memory representation as I guessed. Now the
question is how is performance accessing data in Oracle tables>

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 January 2017 at 21:51, Jörn Franke  wrote:

> Depending on the size of the data i recommend to schedule regularly an
> extract in tableau. There tableau converts it to an internal in-memory
> representation outside of Spark (can also exist on disk if memory is too
> small) and then use it within Tableau. Accessing directly  the database is
> not so efficient.
> Additionally use always the newest version of tableau..
>
> On 30 Jan 2017, at 21:57, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Has anyone tried using Tableau on Spark SQL?
>
> Specifically how does Tableau handle in-memory capabilities of Spark.
>
> As I understand Tableau uses its own propriety SQL against say Oracle.
> That is well established. So for each product Tableau will try to use its
> own version of SQL against that product  like Spark
> or Hive.
>
> However, when I last tried Tableau on Hive, the mapping and performance
> was not that good in comparision with the same tables and data in Hive..
>
> My approach has been to take Oracle 11.g sh schema
> containing
> star schema and create and ingest the same tables and data  into Hive
> tables. Then run Tableau against these tables and do the performance
> comparison. Given that Oracle is widely used with Tableau this test makes
> sense?
>
> Thanks.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
Depending on the size of the data i recommend to schedule regularly an extract 
in tableau. There tableau converts it to an internal in-memory representation 
outside of Spark (can also exist on disk if memory is too small) and then use 
it within Tableau. Accessing directly  the database is not so efficient. 
Additionally use always the newest version of tableau..

> On 30 Jan 2017, at 21:57, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> Has anyone tried using Tableau on Spark SQL?
> 
> Specifically how does Tableau handle in-memory capabilities of Spark.
> 
> As I understand Tableau uses its own propriety SQL against say Oracle. That 
> is well established. So for each product Tableau will try to use its own 
> version of SQL against that product  like Spark
> or Hive.
> 
> However, when I last tried Tableau on Hive, the mapping and performance was 
> not that good in comparision with the same tables and data in Hive..
> 
> My approach has been to take Oracle 11.g sh schema containing star schema and 
> create and ingest the same tables and data  into Hive tables. Then run 
> Tableau against these tables and do the performance comparison. Given that 
> Oracle is widely used with Tableau this test makes sense?
> 
> Thanks.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: mapWithState question

2017-01-30 Thread Tathagata Das
If you care about the semantics of those writes to Kafka, then you should
be aware of two things.
1. There are no transactional writes to Kafka.
2. So, when tasks get reexecuted due to any failure, your mapping function
will also be reexecuted, and the writes to kafka can happen multiple times.
So you may only get at least once guarantee about those Kafka writes


On Mon, Jan 30, 2017 at 10:02 AM, shyla deshpande 
wrote:

> Hello,
>
> TD, your suggestion works great. Thanks
>
> I have 1 more question, I need to write to kafka from within the
> mapWithState function. Just wanted to check if this a bad pattern in any
> way.
>
> Thank you.
>
>
>
>
>
> On Sat, Jan 28, 2017 at 9:14 AM, shyla deshpande  > wrote:
>
>> Thats a great idea. I will try that. Thanks.
>>
>> On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> 1 state object for each user.
>>> union both streams into a single DStream, and apply mapWithState on it
>>> to update the user state.
>>>
>>> On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 Can multiple DStreams manipulate a state? I have a stream that gives
 me total minutes the user spent on a course material. I have another
 stream that gives me chapters completed and lessons completed by the user. 
 I
 want to keep track for each user total_minutes, chapters_completed and
 lessons_completed. I am not sure if I should have 1 state or 2 states. Can
 I lookup the state for a given key just like a map outside the mapfunction?

 Appreciate your input. Thanks

>>>
>>>
>>
>


Tableau BI on Spark SQL

2017-01-30 Thread Mich Talebzadeh
Hi,

Has anyone tried using Tableau on Spark SQL?

Specifically how does Tableau handle in-memory capabilities of Spark.

As I understand Tableau uses its own propriety SQL against say Oracle. That
is well established. So for each product Tableau will try to use its own
version of SQL against that product  like Spark
or Hive.

However, when I last tried Tableau on Hive, the mapping and performance was
not that good in comparision with the same tables and data in Hive..

My approach has been to take Oracle 11.g sh schema
containing
star schema and create and ingest the same tables and data  into Hive
tables. Then run Tableau against these tables and do the performance
comparison. Given that Oracle is widely used with Tableau this test makes
sense?

Thanks.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Saving parquet file in Spark giving error when Encryption at Rest is implemented

2017-01-30 Thread morfious902002
We are using spark 1.6.1 on a CDH 5.5 cluster. The job worked fine with
Kerberos but when we implemented Encryption at Rest we ran into the
following issue:-

Df.write().mode(SaveMode.Append).partitionBy("Partition").parquet(path);

I have already tried setting these values with no success :-


sparkContext.hadoopConfiguration().set("parquet.enable.summary-metadata",
"true"/"false");   

sparkContext.hadoopConfiguration().setInt("parquet.metadata.read.parallelism",
1);

 SparkConf.set("spark.sql.parquet.mergeSchema","false");
 SparkConf.set("spark.sql.parquet.filterPushdown","true");

Ideally I would like to set summary-metadata to false as it will save
sometime during the write. 

17/01/30 18:37:54 WARN hadoop.ParquetOutputCommitter: could not write
summary file for hdfs://abc
java.io.IOException: Could not read footer: java.io.IOException: Could 
not
read footer for file
FileStatus{path=hdfs://abc/Partition=O/part-r-3-95adb09f-627f-42fe-9b89-7631226e998f.gz.parquet;
isDirectory=false; length=12775; replication=3; blocksize=134217728;
modification_time=1485801467817; access_time=1485801467179;
owner=bigdata-service; group=bigdata; permission=rw-rw; isSymlink=false}
at
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:262)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:56)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:149)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at
thomsonreuters.northstar.main.ParquetFileWriter.writeDataToParquet(ParquetFileWriter.java:173)
at
thomsonreuters.northstar.main.SparkProcessor.process(SparkProcessor.java:128)
at 
thomsonreuters.northstar.main.NorthStarMain.main(NorthStarMain.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:558)
Caused by: java.io.IOException: Could not read footer for file
FileStatus{path=hdfs://abc/Partition=O/part-r-3-95adb09f-627f-42fe-9b89-7631226e998f.gz.parquet;
isDirectory=false; length=12775; replication=3; blocksize=134217728;
modification_time=1485801467817; access_time=1485801467179;
owner=bigdata-app-ooxp-service; group=bigdata; permission=rw-rw;
isSymlink=false}
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at

Re: Dynamic resource allocation to Spark on Mesos

2017-01-30 Thread Michael Gummelt
On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan  wrote:

> Tasks begin scheduling as soon as the first executor comes up
>
>
> Thanks all for the clarification. Is this the default behavior of Spark on
> Mesos today? I think this is what we are looking for because sometimes a
> job can take up lots of resources and later jobs could not get all the
> resources that it asks for. If a Spark job starts with only a subset of
> resources that it asks for, does it know to expand its resources later when
> more resources become available?
>

Yes.


>
> Launch each executor with at least 1GB RAM, but if mesos offers 2GB at
>> some moment, then launch an executor with 2GB RAM
>
>
> This is less useful in our use case. But I am also quite interested in
> cases in which this could be helpful. I think this will also help with
> overall resource utilization on the cluster if when another job starts up
> that has a hard requirement on resources, the extra resources to the first
> job can be flexibly re-allocated to the second job.
>
> On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt 
> wrote:
>
>> We've talked about that, but it hasn't become a priority because we
>> haven't had a driving use case.  If anyone has a good argument for
>> "variable" resource allocation like this, please let me know.
>>
>> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin 
>> wrote:
>>
>>> An alternative behavior is to launch the job with the best resource
 offer Mesos is able to give
>>>
>>>
>>> Michael has just made an excellent explanation about dynamic allocation
>>> support in mesos. But IIUC, what you want to achieve is something like
>>> (using RAM as an example) : "Launch each executor with at least 1GB RAM,
>>> but if mesos offers 2GB at some moment, then launch an executor with 2GB
>>> RAM".
>>>
>>> I wonder what's benefit of that? To reduce the "resource fragmentation"?
>>>
>>> Anyway, that is not supported at this moment. In all the supported
>>> cluster managers of spark (mesos, yarn, standalone, and the up-to-coming
>>> spark on kubernetes), you have to specify the cores and memory of each
>>> executor.
>>>
>>> It may not be supported in the future, because only mesos has the
>>> concepts of offers because of its two-level scheduling model.
>>>
>>>
>>> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan  wrote:
>>>
 Dear Spark Users,

 Currently is there a way to dynamically allocate resources to Spark on
 Mesos? Within Spark we can specify the CPU cores, memory before running
 job. The way I understand is that the Spark job will not run if the CPU/Mem
 requirement is not met. This may lead to decrease in overall utilization of
 the cluster. An alternative behavior is to launch the job with the best
 resource offer Mesos is able to give. Is this possible with the current
 implementation?

 Thanks
 Ji

 The information in this email is confidential and may be legally
 privileged. It is intended solely for the addressee. Access to this email
 by anyone else is unauthorized. If you are not the intended recipient, any
 disclosure, copying, distribution or any action taken or omitted to be
 taken in reliance on it, is prohibited and may be unlawful.

>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>



-- 
Michael Gummelt
Software Engineer
Mesosphere


Pyspark 2.1.0 weird behavior with repartition

2017-01-30 Thread Blaž Šnuderl
I am loading a simple text file using pyspark. Repartitioning it seems to
produce garbage data.

I got this results using spark 2.1 prebuilt for hadoop 2.7 using pyspark
shell.

>>> sc.textFile("outc").collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile("outc", use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']

Repartitioning seems to produce garbarge and also only only 2 records here
>>> sc.textFile("outc", use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.',
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile("outc", use_unicode=False).repartition(10).count()
2


Without setting use_unicode=False we can't even repartition at all
>>> sc.textFile("outc").repartition(19).collect()
Traceback (most recent call last):  




  File "", line 1, in 
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py",
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py",
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py",
line 529, in load_stream
yield self.loads(stream)
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py",
line 524, in loads
return s.decode("utf-8") if self.use_unicode else s
  File
"/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
invalid start byte



Input file contents:
a
b
c
d
e
f
g
h
i
j
k
l



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-2-1-0-weird-behavior-with-repartition-tp28350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Hi All,

If I modify the code to below The hive UDF is working in spark-sql but it
is giving different results..Please let me know difference between these
two below codes..

1) public Object get(Object name) {
  int pos = getPos((String)name);
  if(pos<0) return null;
  String f = "string";
  Object obj= list.get(pos);
  Object result = null;
  if(obj==null) return null;
  ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
  if(ins!=null) f = ins.getTypeName();
  PrimitiveObjectInspector ins2 = (PrimitiveObjectInspector)ins;
  switch (ins2.getPrimitiveCategory()) {
  case DOUBLE :{
  result = new Double(((DoubleObjectInspector)ins2).get(obj));
  break;
  }


  case LONG:
  result = new Long(((LongObjectInspector)ins2).get(obj));
  break;
  case STRING:
  result =
((StringObjectInspector)ins2).getPrimitiveJavaObject(obj);
  break;
default  :  result = obj;

  }
  return result;
}






2) public Object get(Object name) {
  int pos = getPos((String)name);
 if(pos<0) return null;
 String f = "string";
  Object obj= list.get(pos);
 if(obj==null) return null;
 ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
 if(ins!=null) f = ins.getTypeName();
 switch (f) {
   case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
   default  :  return obj;
 }
}

On Tue, Jan 24, 2017 at 5:29 PM, Sirisha Cheruvu  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: mapWithState question

2017-01-30 Thread shyla deshpande
Hello,

TD, your suggestion works great. Thanks

I have 1 more question, I need to write to kafka from within the
mapWithState function. Just wanted to check if this a bad pattern in any
way.

Thank you.





On Sat, Jan 28, 2017 at 9:14 AM, shyla deshpande 
wrote:

> Thats a great idea. I will try that. Thanks.
>
> On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> 1 state object for each user.
>> union both streams into a single DStream, and apply mapWithState on it to
>> update the user state.
>>
>> On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Can multiple DStreams manipulate a state? I have a stream that gives me
>>> total minutes the user spent on a course material. I have another
>>> stream that gives me chapters completed and lessons completed by the user. I
>>> want to keep track for each user total_minutes, chapters_completed and
>>> lessons_completed. I am not sure if I should have 1 state or 2 states. Can
>>> I lookup the state for a given key just like a map outside the mapfunction?
>>>
>>> Appreciate your input. Thanks
>>>
>>
>>
>


Re: Dynamic resource allocation to Spark on Mesos

2017-01-30 Thread Ji Yan
>
> Tasks begin scheduling as soon as the first executor comes up


Thanks all for the clarification. Is this the default behavior of Spark on
Mesos today? I think this is what we are looking for because sometimes a
job can take up lots of resources and later jobs could not get all the
resources that it asks for. If a Spark job starts with only a subset of
resources that it asks for, does it know to expand its resources later when
more resources become available?

Launch each executor with at least 1GB RAM, but if mesos offers 2GB at some
> moment, then launch an executor with 2GB RAM


This is less useful in our use case. But I am also quite interested in
cases in which this could be helpful. I think this will also help with
overall resource utilization on the cluster if when another job starts up
that has a hard requirement on resources, the extra resources to the first
job can be flexibly re-allocated to the second job.

On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt 
wrote:

> We've talked about that, but it hasn't become a priority because we
> haven't had a driving use case.  If anyone has a good argument for
> "variable" resource allocation like this, please let me know.
>
> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin  wrote:
>
>> An alternative behavior is to launch the job with the best resource offer
>>> Mesos is able to give
>>
>>
>> Michael has just made an excellent explanation about dynamic allocation
>> support in mesos. But IIUC, what you want to achieve is something like
>> (using RAM as an example) : "Launch each executor with at least 1GB RAM,
>> but if mesos offers 2GB at some moment, then launch an executor with 2GB
>> RAM".
>>
>> I wonder what's benefit of that? To reduce the "resource fragmentation"?
>>
>> Anyway, that is not supported at this moment. In all the supported
>> cluster managers of spark (mesos, yarn, standalone, and the up-to-coming
>> spark on kubernetes), you have to specify the cores and memory of each
>> executor.
>>
>> It may not be supported in the future, because only mesos has the
>> concepts of offers because of its two-level scheduling model.
>>
>>
>> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan  wrote:
>>
>>> Dear Spark Users,
>>>
>>> Currently is there a way to dynamically allocate resources to Spark on
>>> Mesos? Within Spark we can specify the CPU cores, memory before running
>>> job. The way I understand is that the Spark job will not run if the CPU/Mem
>>> requirement is not met. This may lead to decrease in overall utilization of
>>> the cluster. An alternative behavior is to launch the job with the best
>>> resource offer Mesos is able to give. Is this possible with the current
>>> implementation?
>>>
>>> Thanks
>>> Ji
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
How to debug Hive UDfs?!

On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: how to compare two avro format hive tables

2017-01-30 Thread Deepak Sharma
You can use spark testing base's rdd comparators.
Create 2 different dataframes from these 2 hive tables.
Convert them to rdd and use spark-testing-base compareRDD.

Here is an example for rdd comparison:
https://github.com/holdenk/spark-testing-base/wiki/RDDComparisons


On Mon, Jan 30, 2017 at 9:07 PM, Alex  wrote:

> Hi Team,
>
> how to compare two avro format hive tables if there is same data in it
>
> if i give limit 5 its giving different results
>
>
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


how to compare two avro format hive tables

2017-01-30 Thread Alex
Hi Team,

how to compare two avro format hive tables if there is same data in it

if i give limit 5 its giving different results


Re: userClassPathFirst=true prevents SparkContext to be initialized

2017-01-30 Thread Koert Kuipers
i dont know why this is happening but i have given up on
userClassPath=first. i have seen many weird errors with it and consider it
broken.

On Jan 30, 2017 05:24, "Roberto Coluccio" 
wrote:

Hello folks,

I'm trying to work around an issue with some dependencies by trying to
specify at spark-submit time that I want my (user) classpath to be resolved
and taken into account first (against the jars received through the System
Classpath, which is /data/cloudera/parcels/CDH/jars/).

In order to accomplish this, I specify

--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true

and I pass my jars with

--jars 

in my spark-submit command, deploying in yarn cluster mode in a CDH 5.8
environment (Spark 1.6).

In the list passed with --jars I have severals deps, NOT including
hadoop/spark related ones. My app jar is not a fat (uber) one, thus it
includes only business classes. None of these ones has for any reasons a
"SparkConf.set("master", "local")", or anything like that.

Without specifying the userClassPathFirst configuration, my App is launched
and completed with no issues at all.

I tried to print logs down to the TRACE level with no luck. I get no
explicit errors and I verified adding the "-verbose:class" JVM arg that
Spark-related classes seem to be loaded with no issues. From a rapid
overview of loaded classes, it seems to me that a small fraction of classes
is loaded using userClassPathFirst=true w/r/t the default case. Eventually,
my driver's stderr gets stuck in logging out:

2017-01-30 10:10:22,308 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...
2017-01-30 10:10:32,310 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...
2017-01-30 10:10:42,311 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...

Dramatically, the application is then killed by YARN after a timeout.

In my understanding, quoting the doc (http://spark.apache.org/docs/
1.6.2/configuration.html):

[image: Inline image 1]

So I would expect the libs given through --jars options to be used first,
but I also expect no issues in loading the system classpath afterwards.
This is confirmed by the logs printed with the "-verbose:class" JVM option,
where I can see logs like:

[Loaded org.apache.spark.SparkContext from
file:/data/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar]


What am I missing here guys?

Thanks for your help.

Best regards,

Roberto


Reason behind mapping of StringType with CLOB nullType

2017-01-30 Thread Amiya Mishra
Hi,

I am new to spark-sql. I am getting below mapping details in JdbcUtils.scala
as:

*case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))* in line
number 125.

which says SringType will map with Jdbc database type as "TEXT" having jdbc
null type as CLOB , which internally takes 2005 as value.

This mapping usually comes if there is no specific dialect found for
particular database.

*Is there any reason behind this mapping ??*

I am getting this issue because if i want to set null value into redshift
table using below code:

*stmt.setNull(i + 1, nullTypes(i))* in line number 190 of JdbcUtils.scala ,
i am unable to set it.

Here table column can accept null values.

I am using below versions:

spark version: 2.0.2
redshift jdbc driver version: 1.2.1.1001

*Can anybody suggest on this issue or reason behind this mapping ?*





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reason-behind-mapping-of-StringType-with-CLOB-nullType-tp28349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error:
java.lang.Double cannot be cast to
org.apache.hadoop.hive.serde2.io.DoubleWritable]

Getting below error while running hive UDF on spark but the UDF is working
perfectly fine in Hive..


public Object get(Object name) {
  int pos = getPos((String)name);
 if(pos<0) return null;
 String f = "string";
  Object obj= list.get(pos);
 if(obj==null) return null;
 ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
 if(ins!=null) f = ins.getTypeName();
 switch (f) {
   case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
   default  :  return obj;
 }
}



On Tue, Jan 24, 2017 at 9:19 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Could you show us the whole code to reproduce that?
>
> // maropu
>
> On Wed, Jan 25, 2017 at 12:02 AM, Deepak Sharma 
> wrote:
>
>> Can you try writing the UDF directly in spark and register it with spark
>> sql or hive context ?
>> Or do you want to reuse the existing UDF jar for hive in spark ?
>>
>> Thanks
>> Deepak
>>
>> On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:
>>
>>> Hi Team,
>>>
>>> I am trying to keep below code in get method and calling that get mthod
>>> in another hive UDF
>>> and running the hive UDF using Hive Context.sql procedure..
>>>
>>>
>>> switch (f) {
>>> case "double" :  return ((DoubleWritable)obj).get();
>>> case "bigint" :  return ((LongWritable)obj).get();
>>> case "string" :  return ((Text)obj).toString();
>>> default  :  return obj;
>>>   }
>>> }
>>>
>>> Suprisingly only LongWritable and Text convrsions are throwing error but
>>> DoubleWritable is working
>>> So I tried changing below code to
>>>
>>> switch (f) {
>>> case "double" :  return ((DoubleWritable)obj).get();
>>> case "bigint" :  return ((DoubleWritable)obj).get();
>>> case "string" :  return ((Text)obj).toString();
>>> default  :  return obj;
>>>   }
>>> }
>>>
>>> Still its throws error saying Java.Lang.Long cant be convrted
>>> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>>>
>>>
>>>
>>> its working fine on hive but throwing error on spark-sql
>>>
>>> I am importing the below packages.
>>> import java.util.*;
>>> import org.apache.hadoop.hive.serde2.objectinspector.*;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>>>
>>> .Please let me know why it is making issue in spark when perfectly
>>> running fine on hive
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: DAG Visualization option is missing on Spark Web UI

2017-01-30 Thread Md. Rezaul Karim
Hi Mark,

That worked for me! Thanks a million.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 29 January 2017 at 01:53, Mark Hamstra  wrote:

> Try selecting a particular Job instead of looking at the summary page for
> all Jobs.
>
> On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi Jacek,
>>
>> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
>> with ad blocker enabled. I do see other options like* User, Total
>> Uptime, Scheduling Mode, **Active Jobs, Completed Jobs and* Event
>> Timeline. However, I don't see an option for DAG visualization.
>>
>> Please note that I am experiencing the same issue with Spark 2.x (i.e.
>> 2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
>> that I am seeing on my machine:
>>
>> [image: Inline images 1]
>>
>>
>> Please suggest.
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> 
>>
>> On 28 January 2017 at 18:51, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Wonder if you have any adblocker enabled in your browser? Is this the
>>> only version giving you this behavior? All Spark jobs have no
>>> visualization?
>>>
>>> Jacek
>>>
>>> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>> Hi All,
>>>
>>> I am running a Spark job on my local machine written in Scala with Spark
>>> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
>>> http://localhost:4040/jobs/
>>>
>>>
>>> Suggestion, please.
>>>
>>>
>>>
>>>
>>> Regards,
>>> _
>>> *Md. Rezaul Karim*, BSc, MSc
>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> 
>>>
>>>
>>>
>>
>


Pruning decision tree in Spark

2017-01-30 Thread Md. Rezaul Karim
Hi there,

Say, I have a deeper tree that needs to be pruned to create an optimal
tree. For example, in R it can be done using *rpart/prune *function.

Is it possible to prune a* Spark MLlib/ML-based decision tree* while
performing a classification or regression task?




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



userClassPathFirst=true prevents SparkContext to be initialized

2017-01-30 Thread Roberto Coluccio
Hello folks,

I'm trying to work around an issue with some dependencies by trying to
specify at spark-submit time that I want my (user) classpath to be resolved
and taken into account first (against the jars received through the System
Classpath, which is /data/cloudera/parcels/CDH/jars/).

In order to accomplish this, I specify

--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true

and I pass my jars with

--jars 

in my spark-submit command, deploying in yarn cluster mode in a CDH 5.8
environment (Spark 1.6).

In the list passed with --jars I have severals deps, NOT including
hadoop/spark related ones. My app jar is not a fat (uber) one, thus it
includes only business classes. None of these ones has for any reasons a
"SparkConf.set("master", "local")", or anything like that.

Without specifying the userClassPathFirst configuration, my App is launched
and completed with no issues at all.

I tried to print logs down to the TRACE level with no luck. I get no
explicit errors and I verified adding the "-verbose:class" JVM arg that
Spark-related classes seem to be loaded with no issues. From a rapid
overview of loaded classes, it seems to me that a small fraction of classes
is loaded using userClassPathFirst=true w/r/t the default case. Eventually,
my driver's stderr gets stuck in logging out:

2017-01-30 10:10:22,308 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...
2017-01-30 10:10:32,310 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...
2017-01-30 10:10:42,311 INFO  ApplicationMaster:58 - Waiting for spark
context initialization ...

Dramatically, the application is then killed by YARN after a timeout.

In my understanding, quoting the doc (
http://spark.apache.org/docs/1.6.2/configuration.html):

[image: Inline image 1]

So I would expect the libs given through --jars options to be used first,
but I also expect no issues in loading the system classpath afterwards.
This is confirmed by the logs printed with the "-verbose:class" JVM option,
where I can see logs like:

[Loaded org.apache.spark.SparkContext from
file:/data/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar]


What am I missing here guys?

Thanks for your help.

Best regards,

Roberto


ML model to associate search terms with objects and reranking them

2017-01-30 Thread dilip.srid...@unvired.com
Dear Spark ML Community,

Is there an ML model to associate 'search terms' with objects (articles,
etc.). I have considered PIO Text Classification and Universal
recommendation variants. But these mainly help categorise or find related
items and do not allow to associate 'search term' with an object. Also
Events to capture if the served results were appropriate or not (Worked and
Did Not Work kind of events for search term - object combination).

Or is it better to use a Reranking algorithms instead of ML templates from
PIO? Thanks for any suggestions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ML-model-to-associate-search-terms-with-objects-and-reranking-them-tp28347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Having multiple spark context

2017-01-30 Thread Rohit Verma
Two ways,

1. There is an experimental support for this. Read at 
https://issues.apache.org/jira/browse/SPARK-2243. Afraid you might need to 
build spark from source code.
2. Use middleware. Deploy two apps separately communicating with your app over 
messaging/rest.

Regards
Rohit

On Jan 30, 2017, at 2:07 PM, 
jasbir.s...@accenture.com wrote:

Is there any way in which my application can connect to multiple Spark Clusters?
Or is communication between Spark clusters possible?

Regards,
Jasbir

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Monday, January 30, 2017 1:33 PM
To: vincent gromakowski 
>
Cc: Rohit Verma >; 
user@spark.apache.org; Sing, Jasbir 
>; Mark Hamstra 
>
Subject: Re: Having multiple spark context

in general in a single JVM which is basically running in Local mode, you have 
only one Spark Context. However, you can stop the current Spark Context by

sc.stop()

HTH

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.


On 30 January 2017 at 07:54, vincent gromakowski 
> wrote:

A clustering lib is necessary to manage multiple jvm. Akka cluster for instance


Le 30 janv. 2017 8:01 AM, "Rohit Verma" 
> a écrit :
Hi,

If I am right, you need to launch other context from another jvm. If you are 
trying to launch from same jvm another context it will return you the existing 
context.

Rohit
On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
> wrote:

More than one Spark Context in a single Application is not supported.

On Sun, Jan 29, 2017 at 9:08 PM, 
> wrote:
Hi,

I have a requirement in which, my application creates one Spark context in 
Distributed mode whereas another Spark context in local mode.
When I am creating this, my complete application is working on only one 
SparkContext (created in Distributed mode). Second spark context is not getting 
created.

Can you please help me out in how to create two spark contexts.

Regards,
Jasbir singh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com



RE: Having multiple spark context

2017-01-30 Thread jasbir.sing
Is there any way in which my application can connect to multiple Spark Clusters?
Or is communication between Spark clusters possible?

Regards,
Jasbir

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Monday, January 30, 2017 1:33 PM
To: vincent gromakowski 
Cc: Rohit Verma ; user@spark.apache.org; Sing, 
Jasbir ; Mark Hamstra 
Subject: Re: Having multiple spark context

in general in a single JVM which is basically running in Local mode, you have 
only one Spark Context. However, you can stop the current Spark Context by

sc.stop()

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 30 January 2017 at 07:54, vincent gromakowski 
> wrote:

A clustering lib is necessary to manage multiple jvm. Akka cluster for instance

Le 30 janv. 2017 8:01 AM, "Rohit Verma" 
> a écrit :
Hi,

If I am right, you need to launch other context from another jvm. If you are 
trying to launch from same jvm another context it will return you the existing 
context.

Rohit
On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
> wrote:

More than one Spark Context in a single Application is not supported.

On Sun, Jan 29, 2017 at 9:08 PM, 
> wrote:
Hi,

I have a requirement in which, my application creates one Spark context in 
Distributed mode whereas another Spark context in local mode.
When I am creating this, my complete application is working on only one 
SparkContext (created in Distributed mode). Second spark context is not getting 
created.

Can you please help me out in how to create two spark contexts.

Regards,
Jasbir singh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com





Re: Having multiple spark context

2017-01-30 Thread Mich Talebzadeh
in general in a single JVM which is basically running in Local mode, you
have only one Spark Context. However, you can stop the current Spark
Context by

sc.stop()

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 January 2017 at 07:54, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> A clustering lib is necessary to manage multiple jvm. Akka cluster for
> instance
>
> Le 30 janv. 2017 8:01 AM, "Rohit Verma"  a
> écrit :
>
>> Hi,
>>
>> If I am right, you need to launch other context from another jvm. If you
>> are trying to launch from same jvm another context it will return you the
>> existing context.
>>
>> Rohit
>>
>> On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
>> wrote:
>>
>> More than one Spark Context in a single Application is not supported.
>>
>> On Sun, Jan 29, 2017 at 9:08 PM,  wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I have a requirement in which, my application creates one Spark context
>>> in Distributed mode whereas another Spark context in local mode.
>>>
>>> When I am creating this, my complete application is working on only one
>>> SparkContext (created in Distributed mode). Second spark context is not
>>> getting created.
>>>
>>>
>>>
>>> Can you please help me out in how to create two spark contexts.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Jasbir singh
>>>
>>> --
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> 
>>> __
>>>
>>> www.accenture.com
>>>
>>
>>
>>