Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi,

Cody's right.

subscribe - Topic subscription strategy that accepts topic names as a
comma-separated string, e.g. topic1,topic2,topic3 [1]

subscribepattern - Topic subscription strategy that uses Java’s
java.util.regex.Pattern for the topic subscription regex pattern of topics
to subscribe to, e.g. topic\d [2]

[1]
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribe
[2]
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribepattern

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Sep 19, 2017 at 10:34 PM, Cody Koeninger  wrote:

> You should be able to pass a comma separated string of topics to
> subscribe.  subscribePattern isn't necessary
>
>
>
> On Tue, Sep 19, 2017 at 2:54 PM, kant kodali  wrote:
> > got it! Sorry.
> >
> > On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> Use subscribepattern
> >>
> >> You haven't googled well enough -->
> >> https://jaceklaskowski.gitbooks.io/spark-structured-
> streaming/spark-sql-streaming-KafkaSource.html
> >> :)
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://about.me/JacekLaskowski
> >> Spark Structured Streaming (Apache Spark 2.2+)
> >> https://bit.ly/spark-structured-streaming
> >> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> On Tue, Sep 19, 2017 at 9:50 PM, kant kodali 
> wrote:
> >>>
> >>> HI All,
> >>>
> >>> I am wondering How to read from multiple kafka topics using structured
> >>> streaming (code below)? I googled prior to asking this question and I
> see
> >>> responses related to Dstreams but not structured streams. Is it
> possible to
> >>> read multiple topics using the same spark structured stream?
> >>>
> >>> sparkSession.readStream()
> >>> .format("kafka")
> >>> .option("kafka.bootstrap.servers", "localhost:9092")
> >>> .option("subscribe", "hello1")
> >>> .option("startingOffsets", "earliest")
> >>> .option("failOnDataLoss", "false")
> >>> .load();
> >>>
> >>>
> >>> Thanks!
> >>
> >>
> >
>


Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason.

*Doesn't work *

outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello1")).start().awaitTermination()

outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello2")).start().awaitTermination()

*Works*

StreamingQuery query1 = outputDS1.writeStream().trigger(Trigger.
processingTime(1000)).foreach(new KafkaSink("hello1")).start();

query1.awaitTermination()

StreamingQuery query2 =outputDS2.writeStream().trigger(Trigger.
processingTime(1000)).foreach(new KafkaSink("hello2")).start();

query2.awaitTermination()



On Tue, Sep 19, 2017 at 10:09 PM, kant kodali  wrote:

> Looks like my problem was the order of awaitTermination() for some reason.
>
> Doesn't work
>
>
>
>
>
> On Tue, Sep 19, 2017 at 1:54 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I have the following Psuedo code (I could paste the real code however it
>> is pretty long and involves Database calls inside dataset.map operation and
>> so on) so I am just trying to simplify my question. would like to know if
>> there is something wrong with the following pseudo code?
>>
>> DataSet inputDS = readFromKaka(topicName)
>>
>> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works
>> Since I can see data getting populated
>>
>> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Works as
>> well
>>
>> DataSet outputDS2 = mongoDS.map( readFromDatabase); // Doesn't
>> work
>>
>> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello1")).start().awaitTermination()
>>
>> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello2")).start().awaitTermination()
>>
>>
>> *So what's happening with above code is that I can see data coming out of
>> hello1 topic but not from hello2 topic.* I thought there is something
>> wrong with "outputDS2" so I switched the order  so now the code looks like
>> this
>>
>> DataSet inputDS = readFromKaka(topicName)
>>
>> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works
>> Since I can see data getting populated
>>
>> DataSet outputDS2 = mongoDS.map( readFromDatabase); // This Works
>>
>> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Desn't work
>>
>> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello1")).start().awaitTermination()
>>
>> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello2")).start().awaitTermination()
>>
>> *Now I can see data coming out from hello2 kafka topic but not from
>> hello1 topic*. *In  short, I can only see data from outputDS1 or
>> outputDS2 but not both. * At this point I am not sure what is going on?
>>
>> Thanks!
>>
>>
>>
>


Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi,

Ah, right! Start the queries and once they're running, awaitTermination
them.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Wed, Sep 20, 2017 at 7:09 AM, kant kodali  wrote:

> Looks like my problem was the order of awaitTermination() for some reason.
>
> Doesn't work
>
>
>
>
>
> On Tue, Sep 19, 2017 at 1:54 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I have the following Psuedo code (I could paste the real code however it
>> is pretty long and involves Database calls inside dataset.map operation and
>> so on) so I am just trying to simplify my question. would like to know if
>> there is something wrong with the following pseudo code?
>>
>> DataSet inputDS = readFromKaka(topicName)
>>
>> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works
>> Since I can see data getting populated
>>
>> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Works as
>> well
>>
>> DataSet outputDS2 = mongoDS.map( readFromDatabase); // Doesn't
>> work
>>
>> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello1")).start().awaitTermination()
>>
>> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello2")).start().awaitTermination()
>>
>>
>> *So what's happening with above code is that I can see data coming out of
>> hello1 topic but not from hello2 topic.* I thought there is something
>> wrong with "outputDS2" so I switched the order  so now the code looks like
>> this
>>
>> DataSet inputDS = readFromKaka(topicName)
>>
>> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works
>> Since I can see data getting populated
>>
>> DataSet outputDS2 = mongoDS.map( readFromDatabase); // This Works
>>
>> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Desn't work
>>
>> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello1")).start().awaitTermination()
>>
>> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
>> KafkaSink("hello2")).start().awaitTermination()
>>
>> *Now I can see data coming out from hello2 kafka topic but not from
>> hello1 topic*. *In  short, I can only see data from outputDS1 or
>> outputDS2 but not both. * At this point I am not sure what is going on?
>>
>> Thanks!
>>
>>
>>
>


Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi,

What's the code in readFromKafka to read from hello2 and hello1?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Sep 19, 2017 at 10:54 PM, kant kodali  wrote:

> Hi All,
>
> I have the following Psuedo code (I could paste the real code however it
> is pretty long and involves Database calls inside dataset.map operation and
> so on) so I am just trying to simplify my question. would like to know if
> there is something wrong with the following pseudo code?
>
> DataSet inputDS = readFromKaka(topicName)
>
> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since
> I can see data getting populated
>
> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Works as well
>
> DataSet outputDS2 = mongoDS.map( readFromDatabase); // Doesn't work
>
> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello1")).start().awaitTermination()
>
> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello2")).start().awaitTermination()
>
>
> *So what's happening with above code is that I can see data coming out of
> hello1 topic but not from hello2 topic.* I thought there is something
> wrong with "outputDS2" so I switched the order  so now the code looks like
> this
>
> DataSet inputDS = readFromKaka(topicName)
>
> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since
> I can see data getting populated
>
> DataSet outputDS2 = mongoDS.map( readFromDatabase); // This Works
>
> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Desn't work
>
> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello1")).start().awaitTermination()
>
> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello2")).start().awaitTermination()
>
> *Now I can see data coming out from hello2 kafka topic but not from hello1
> topic*. *In  short, I can only see data from outputDS1 or outputDS2 but
> not both. * At this point I am not sure what is going on?
>
> Thanks!
>
>
>


Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason.

Doesn't work





On Tue, Sep 19, 2017 at 1:54 PM, kant kodali  wrote:

> Hi All,
>
> I have the following Psuedo code (I could paste the real code however it
> is pretty long and involves Database calls inside dataset.map operation and
> so on) so I am just trying to simplify my question. would like to know if
> there is something wrong with the following pseudo code?
>
> DataSet inputDS = readFromKaka(topicName)
>
> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since
> I can see data getting populated
>
> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Works as well
>
> DataSet outputDS2 = mongoDS.map( readFromDatabase); // Doesn't work
>
> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello1")).start().awaitTermination()
>
> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello2")).start().awaitTermination()
>
>
> *So what's happening with above code is that I can see data coming out of
> hello1 topic but not from hello2 topic.* I thought there is something
> wrong with "outputDS2" so I switched the order  so now the code looks like
> this
>
> DataSet inputDS = readFromKaka(topicName)
>
> DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since
> I can see data getting populated
>
> DataSet outputDS2 = mongoDS.map( readFromDatabase); // This Works
>
> DataSet outputDS1 = mongoDS.map(readFromDatabase); // Desn't work
>
> outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello1")).start().awaitTermination()
>
> outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
> KafkaSink("hello2")).start().awaitTermination()
>
> *Now I can see data coming out from hello2 kafka topic but not from hello1
> topic*. *In  short, I can only see data from outputDS1 or outputDS2 but
> not both. * At this point I am not sure what is going on?
>
> Thanks!
>
>
>


RE: Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Sudha KS
To set Spark2 as default, refer 
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_admin.html#default_tools

-Original Message-
From: Gaurav1809 [mailto:gauravhpan...@gmail.com] 
Sent: Wednesday, September 20, 2017 9:16 AM
To: user@spark.apache.org
Subject: Cloudera - How to switch to the newly added Spark service (Spark2) 
from Spark 1.6 in CDH 5.12

Hello all,

I downloaded CDH and it comes with Spark 1.6 As per the step by step guide 
given - I added Spark 2 in the services list.
Now I can see both Spark 1.6 & Spark 2 And when I do Spark-Shell in terminal 
window, it starts with Spark 1.6 only. How to switch to Spark 2? What all 
_HOMEs or paramters I need to set up?(Or do I need to deleted the older 
service)? Any pointers towards this will be helpful.

Thanks
Gaurav



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Gaurav1809
Hello all,

I downloaded CDH and it comes with Spark 1.6
As per the step by step guide given - I added Spark 2 in the services list.
Now I can see both Spark 1.6 & Spark 2 And when I do Spark-Shell in terminal
window, it starts with Spark 1.6 only. How to switch to Spark 2? What all
_HOMEs or paramters I need to set up?(Or do I need to deleted the older
service)? Any pointers towards this will be helpful.

Thanks
Gaurav



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2017-09-19 Thread shshann
 --- 
 TSMC PROPERTY   
 This email communication (and any attachments) is proprietary information   
 for the sole use of its 
 intended recipient. Any unauthorized review, use or distribution by anyone  
 other than the intended 
 recipient is strictly prohibited.  If you are not the intended recipient,   
 please notify the sender by 
 replying to this email, and then delete this email and any copies of it 
 immediately. Thank you. 
 --- 








-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Structured streaming coding question

2017-09-19 Thread kant kodali
Hi All,

I have the following Psuedo code (I could paste the real code however it is
pretty long and involves Database calls inside dataset.map operation and so
on) so I am just trying to simplify my question. would like to know if
there is something wrong with the following pseudo code?

DataSet inputDS = readFromKaka(topicName)

DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since I
can see data getting populated

DataSet outputDS1 = mongoDS.map(readFromDatabase); // Works as well

DataSet outputDS2 = mongoDS.map( readFromDatabase); // Doesn't work

outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello1")).start().awaitTermination()

outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello2")).start().awaitTermination()


*So what's happening with above code is that I can see data coming out of
hello1 topic but not from hello2 topic.* I thought there is something wrong
with "outputDS2" so I switched the order  so now the code looks like this

DataSet inputDS = readFromKaka(topicName)

DataSet mongoDS = inputDS.map(insertIntoDatabase); // Works Since I
can see data getting populated

DataSet outputDS2 = mongoDS.map( readFromDatabase); // This Works

DataSet outputDS1 = mongoDS.map(readFromDatabase); // Desn't work

outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello1")).start().awaitTermination()

outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new
KafkaSink("hello2")).start().awaitTermination()

*Now I can see data coming out from hello2 kafka topic but not from hello1
topic*. *In  short, I can only see data from outputDS1 or outputDS2 but not
both. * At this point I am not sure what is going on?

Thanks!


Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Cody Koeninger
You should be able to pass a comma separated string of topics to
subscribe.  subscribePattern isn't necessary



On Tue, Sep 19, 2017 at 2:54 PM, kant kodali  wrote:
> got it! Sorry.
>
> On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Use subscribepattern
>>
>> You haven't googled well enough -->
>> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
>> :)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Spark Structured Streaming (Apache Spark 2.2+)
>> https://bit.ly/spark-structured-streaming
>> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> On Tue, Sep 19, 2017 at 9:50 PM, kant kodali  wrote:
>>>
>>> HI All,
>>>
>>> I am wondering How to read from multiple kafka topics using structured
>>> streaming (code below)? I googled prior to asking this question and I see
>>> responses related to Dstreams but not structured streams. Is it possible to
>>> read multiple topics using the same spark structured stream?
>>>
>>> sparkSession.readStream()
>>> .format("kafka")
>>> .option("kafka.bootstrap.servers", "localhost:9092")
>>> .option("subscribe", "hello1")
>>> .option("startingOffsets", "earliest")
>>> .option("failOnDataLoss", "false")
>>> .load();
>>>
>>>
>>> Thanks!
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SVD computation limit

2017-09-19 Thread Vadim Semenov
This may also be related to
https://issues.apache.org/jira/browse/SPARK-22033

On Tue, Sep 19, 2017 at 3:40 PM, Mark Bittmann  wrote:

> I've run into this before. The EigenValueDecomposition creates a Java
> Array with 2*k*n elements. The Java Array is indexed with a native integer
> type, so 2*k*n cannot exceed Integer.MAX_VALUE values.
>
> The array is created here:
> https://github.com/apache/spark/blob/master/mllib/src/main/
> scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala#L84
>
> If you remove the requirement that 2*k*n fail with java.lang.NegativeArraySizeException. More here on this issue
> here:
> https://issues.apache.org/jira/browse/SPARK-5656
>
> On Tue, Sep 19, 2017 at 9:49 AM, Alexander Ovcharenko <
> shurik@gmail.com> wrote:
>
>> Hello guys,
>>
>> While trying to compute SVD using computeSVD() function, i am getting the
>> following warning with the follow up exception:
>> 17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and
>> n=191077, please check necessity
>> IllegalArgumentException: u'requirement failed: k = 49865 and/or n =
>> 191077 are too large to compute an eigendecomposition'
>>
>> When I try to compute first 3000 singular values, I'm getting several
>> following warnings every second:
>> 17/09/14 13:43:38 WARN TaskSetManager: Stage 4802 contains a task of very
>> large size (135 KB). The maximum recommended task size is 100 KB.
>>
>> The matrix size is 49865 x 191077 and all the singular values are needed.
>>
>> Is there a way to lift that limit and be able to compute whatever number
>> of singular values?
>>
>> Thank you.
>>
>>
>>
>


Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
got it! Sorry.

On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski  wrote:

> Hi,
>
> Use subscribepattern
>
> You haven't googled well enough --> https://jaceklaskowski.
> gitbooks.io/spark-structured-streaming/spark-sql-streaming-
> KafkaSource.html :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-
> structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Tue, Sep 19, 2017 at 9:50 PM, kant kodali  wrote:
>
>> HI All,
>>
>> I am wondering How to read from multiple kafka topics using structured
>> streaming (code below)? I googled prior to asking this question and I see
>> responses related to Dstreams but not structured streams. Is it possible to
>> read multiple topics using the same spark structured stream?
>>
>> sparkSession.readStream()
>> .format("kafka")
>> .option("kafka.bootstrap.servers", "localhost:9092")
>> .option("subscribe", "hello1")
>> .option("startingOffsets", "earliest")
>> .option("failOnDataLoss", "false")
>> .load();
>>
>>
>> Thanks!
>>
>>
>


Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi,

Use subscribepattern

You haven't googled well enough -->
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
:)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Sep 19, 2017 at 9:50 PM, kant kodali  wrote:

> HI All,
>
> I am wondering How to read from multiple kafka topics using structured
> streaming (code below)? I googled prior to asking this question and I see
> responses related to Dstreams but not structured streams. Is it possible to
> read multiple topics using the same spark structured stream?
>
> sparkSession.readStream()
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "hello1")
> .option("startingOffsets", "earliest")
> .option("failOnDataLoss", "false")
> .load();
>
>
> Thanks!
>
>


How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
HI All,

I am wondering How to read from multiple kafka topics using structured
streaming (code below)? I googled prior to asking this question and I see
responses related to Dstreams but not structured streams. Is it possible to
read multiple topics using the same spark structured stream?

sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "hello1")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load();


Thanks!


Re: SVD computation limit

2017-09-19 Thread Mark Bittmann
I've run into this before. The EigenValueDecomposition creates a Java Array
with 2*k*n elements. The Java Array is indexed with a native integer type,
so 2*k*n cannot exceed Integer.MAX_VALUE values.

The array is created here:
https://github.com/apache/spark/blob/master/mllib/src/
main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala#L84

If you remove the requirement that 2*k*nhttps://issues.apache.org/jira/browse/SPARK-5656

On Tue, Sep 19, 2017 at 9:49 AM, Alexander Ovcharenko 
wrote:

> Hello guys,
>
> While trying to compute SVD using computeSVD() function, i am getting the
> following warning with the follow up exception:
> 17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and n=191077,
> please check necessity
> IllegalArgumentException: u'requirement failed: k = 49865 and/or n =
> 191077 are too large to compute an eigendecomposition'
>
> When I try to compute first 3000 singular values, I'm getting several
> following warnings every second:
> 17/09/14 13:43:38 WARN TaskSetManager: Stage 4802 contains a task of very
> large size (135 KB). The maximum recommended task size is 100 KB.
>
> The matrix size is 49865 x 191077 and all the singular values are needed.
>
> Is there a way to lift that limit and be able to compute whatever number
> of singular values?
>
> Thank you.
>
>
>


Re: Question on partitionColumn for a JDBC read using a timestamp from MySql

2017-09-19 Thread lucas.g...@gmail.com
I ended up doing this for the time being.  It works but I *think* that
timestamp seems like a rational partitionColumn and I'm wondering if
there's a more built in way:


> df = spark.read.jdbc(
>
> url=os.environ["JDBCURL"],
>
> table="schema.table",
>
> predicates=predicates
>
> )
>

where Predicates is a list of:

>  ["timestamp between '2017-08-01 00:00:01' and '2017-09-01 00:00:01'",

 "timestamp between '2017-09-01 00:00:01' and '2017-10-01 00:00:01'",

 "timestamp between '2017-10-01 00:00:01' and '2017-11-01 00:00:01'"]




Gets quite nice performance (better than I expected).

Thanks!

On 18 September 2017 at 13:21, lucas.g...@gmail.com 
wrote:

>  I'm pretty sure you can use a timestamp as a partitionColumn, It's
> Timestamp type in MySQL.  It's at base a numeric type and Spark requires a
> numeric type passed in.
>
> This doesn't work as the where parameter in MySQL becomes raw numerics
> which won't query against the mysql Timestamp.
>
>
> minTimeStamp = 1325605540 <-- This is wrong, but I'm not sure what to put
>> in here.
>>
>> maxTimeStamp = 1505641420
>>
>> numPartitions = 20*7
>>
>>
>> dt = spark.read \
>>
>> .format("jdbc") \
>>
>> .option("url", os.environ["JDBC_URL"]) \
>>
>> .option("dbtable", "schema.table") \
>>
>> .option("numPartitions", numPartitions) \
>>
>> .option("partitionColumn", "Timestamp") \
>>
>> .option("lowerBound", minTimeStamp) \
>>
>> .option("upperBound", maxTimeStamp) \
>>
>> .load()
>>
>
> mysql DB schema:
>
>> create table table
>>
>> (
>>
>> EventId VARCHAR(50) not null primary key,
>>
>> userid VARCHAR(200) null,
>>
>> Timestamp TIMESTAMP(19) default CURRENT_TIMESTAMP not null,
>>
>> Referrer VARCHAR(4000) null,
>>
>> ViewedUrl VARCHAR(4000) null
>>
>> );
>>
>> create index Timestamp on Fact_PageViewed (Timestamp);
>>
>
> I'm obviously doing it wrong, but couldn't find anything obvious while
> digging around.
>
> The query that gets generated looks like this (not exactly, it's optimized
> to include some upstream query parameters):
>
>>
>> *SELECT *`Timestamp`,`Referrer`,`EventId`,`UserId`,`ViewedUrl`
>> *FROM *schema.table  (*Timestamp*)
>> *WHERE  Timestamp *>= 1452916570 *AND Timestamp *< 1454202540;  <-- this
>> doesn't query against mysql timestamp type meaningfully.
>
>
> Thanks!
>
> Gary Lucas
>
>


Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Have you tried to cache? maybe after the collect() and before the map?

> On Sep 19, 2017, at 7:20 AM, Daniel O' Shaughnessy 
>  wrote:
> 
> Thanks for your response Jean. 
> 
> I managed to figure this out in the end but it's an extremely slow solution 
> and not tenable for my use-case:
> 
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   rddX.take(5).foreach(println)
>   val severalRows = rddX.collect().map(row =>
> if (row.length == 1) {
>   
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
> } else {
>   row.map(tool => {
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   })
>   })
> 
> Wondering if there is any better/faster way to do this ?
> 
> Thanks.
> 
> 
> 
> On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  > wrote:
> Hey Daniel, not sure this will help, but... I had a similar need where i 
> wanted the content of a dataframe to become a "cell" or a row in the parent 
> dataframe. I grouped by the child dataframe, then collect it as a list in the 
> parent dataframe after a join operation. As I said, not sure it matches your 
> use case, but HIH...
> jg
> 
>> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>> > wrote:
>> 
>> Hi guys,
>> 
>> I'm having trouble implementing this scenario:
>> 
>> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
>> 'pear', 'pear']
>> 
>> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>> 
>> I'm attempting to do this but because of the nested operation on another RDD 
>> I get the NPE.
>> 
>> Here's my code so far, thanks:
>> 
>> val dfWithSchema = 
>> sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", "event_name")
>> 
>>   // attempting
>>   import sqlContext.implicits._
>>   val event_list = dfWithSchema.select("event_name").distinct
>>   val event_listDF = event_list.toDF()
>>   val eventIndexer = new StringIndexer()
>> .setInputCol("event_name")
>> .setOutputCol("eventIndex")
>> .fit(event_listDF)
>> 
>>   val eventIndexed = eventIndexer.transform(event_listDF)
>> 
>>   val converter = new IndexToString()
>> .setInputCol("eventIndex")
>> .setOutputCol("originalCategory")
>> 
>>   val convertedEvents = converter.transform(eventIndexed)
>>   val rddX = 
>> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>>  replaceAll ("[\\[\\]\"]", "")).toList)
>>   //val oneRow = 
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>>   val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " + 
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>>  Seq(row).toString)
>>   })
>>   // attempting
> 



SVD computation limit

2017-09-19 Thread Alexander Ovcharenko
Hello guys,

While trying to compute SVD using computeSVD() function, i am getting the
following warning with the follow up exception:
17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and n=191077,
please check necessity
IllegalArgumentException: u'requirement failed: k = 49865 and/or n = 191077
are too large to compute an eigendecomposition'

When I try to compute first 3000 singular values, I'm getting several
following warnings every second:
17/09/14 13:43:38 WARN TaskSetManager: Stage 4802 contains a task of very
large size (135 KB). The maximum recommended task size is 100 KB.

The matrix size is 49865 x 191077 and all the singular values are needed.

Is there a way to lift that limit and be able to compute whatever number of
singular values?

Thank you.


Re: Nested RDD operation

2017-09-19 Thread ayan guha
How big is the list of fruits in your example? Can you broadcast it?

On Tue, 19 Sep 2017 at 9:21 pm, Daniel O' Shaughnessy <
danieljamesda...@gmail.com> wrote:

> Thanks for your response Jean.
>
> I managed to figure this out in the end but it's an extremely slow
> solution and not tenable for my use-case:
>
> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
> ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
> //val oneRow =
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> rddX.take(5).foreach(println)
> val severalRows = rddX.collect().map(row =>
> if (row.length == 1) {
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
> ).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0
> ))
> } else {
> row.map(tool => {
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
> (tool.toString)).toDF("event_name")).select("eventIndex"
> ).first().getDouble(0))
> })
> })
>
> Wondering if there is any better/faster way to do this ?
>
> Thanks.
>
>
>
> On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  wrote:
>
>> Hey Daniel, not sure this will help, but... I had a similar need where i
>> wanted the content of a dataframe to become a "cell" or a row in the parent
>> dataframe. I grouped by the child dataframe, then collect it as a list in
>> the parent dataframe after a join operation. As I said, not sure it matches
>> your use case, but HIH...
>> jg
>>
>> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy <
>> danieljamesda...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I'm having trouble implementing this scenario:
>>
>> I have a column with a typical entry being : ['apple', 'orange', 'apple',
>> 'pear', 'pear']
>>
>> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>>
>> I'm attempting to do this but because of the nested operation on another
>> RDD I get the NPE.
>>
>> Here's my code so far, thanks:
>>
>> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF(
>> "email", "event_name")
>>
>> // attempting
>> import sqlContext.implicits._
>> val event_list = dfWithSchema.select("event_name").distinct
>> val event_listDF = event_list.toDF()
>> val eventIndexer = new StringIndexer()
>> .setInputCol("event_name")
>> .setOutputCol("eventIndex")
>> .fit(event_listDF)
>>
>> val eventIndexed = eventIndexer.transform(event_listDF)
>>
>> val converter = new IndexToString()
>> .setInputCol("eventIndex")
>> .setOutputCol("originalCategory")
>>
>> val convertedEvents = converter.transform(eventIndexed)
>> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0
>> ).split(",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
>> //val oneRow =
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>>
>> val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " +
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
>> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
>> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
>> (row).toString)
>> })
>> // attempting
>>
>>
>> --
Best Regards,
Ayan Guha


Uses of avg hash probe metric in HashAggregateExec?

2017-09-19 Thread Jacek Laskowski
Hi,

I've just noticed the new "avg hash probe" metric for HashAggregateExec
operator [1].

My understanding (after briefly going through the code in the change and
around) is as follows:

Average hash map probe per lookup (i.e. `numProbes` / `numKeyLookups`)

NOTE: `numProbes` and `numKeyLookups` are used in BytesToBytesMap
append-only hash map for the number of iteration to look up a single key
and the number of all the lookups in total, respectively.

Does this description explain the purpose of the metric? Anything important
to add?

Given the way it's calculated and the meaning of numProbes, the higher
`numProbes` the worse. When would that happen? Why is this important for
joins? Why does this affect HashAggregateExec? Does this have anything to
do with broadcast joins?

I'd appreciate any help on this. Promise to share it with the Spark
community in my gitbook or any other place you point to :) Thanks!

[1]
https://github.com/apache/spark/commit/18066f2e61f430b691ed8a777c9b4e5786bf9dbc

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


Re: Nested RDD operation

2017-09-19 Thread Daniel O' Shaughnessy
Thanks for your response Jean.

I managed to figure this out in the end but it's an extremely slow solution
and not tenable for my use-case:

val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
//val oneRow =
Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
rddX.take(5).foreach(println)
val severalRows = rddX.collect().map(row =>
if (row.length == 1) {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
} else {
row.map(tool => {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(
0))
})
})

Wondering if there is any better/faster way to do this ?

Thanks.



On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  wrote:

> Hey Daniel, not sure this will help, but... I had a similar need where i
> wanted the content of a dataframe to become a "cell" or a row in the parent
> dataframe. I grouped by the child dataframe, then collect it as a list in
> the parent dataframe after a join operation. As I said, not sure it matches
> your use case, but HIH...
> jg
>
> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy <
> danieljamesda...@gmail.com> wrote:
>
> Hi guys,
>
> I'm having trouble implementing this scenario:
>
> I have a column with a typical entry being : ['apple', 'orange', 'apple',
> 'pear', 'pear']
>
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>
> I'm attempting to do this but because of the nested operation on another
> RDD I get the NPE.
>
> Here's my code so far, thanks:
>
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF(
> "email", "event_name")
>
> // attempting
> import sqlContext.implicits._
> val event_list = dfWithSchema.select("event_name").distinct
> val event_listDF = event_list.toDF()
> val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
>
> val eventIndexed = eventIndexer.transform(event_listDF)
>
> val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
>
> val convertedEvents = converter.transform(eventIndexed)
> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
> ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
> //val oneRow =
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>
> val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " +
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
> (row).toString)
> })
> // attempting
>
>
>


Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-19 Thread HARSH TAKKAR
Thanks Cody,

It worked for me buy keeping num executor with each having 1 core = num of
partitions of kafka.



On Mon, Sep 18, 2017 at 8:47 PM Cody Koeninger  wrote:

> Have you searched in jira, e.g.
>
> https://issues.apache.org/jira/browse/SPARK-19185
>
> On Mon, Sep 18, 2017 at 1:56 AM, HARSH TAKKAR 
> wrote:
> > Hi
> >
> > Changing spark version if my last resort, is there any other workaround
> for
> > this problem.
> >
> >
> > On Mon, Sep 18, 2017 at 11:43 AM pandees waran 
> wrote:
> >>
> >> All, May I know what exactly changed in 2.1.1 which solved this problem?
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 17, 2017, at 11:08 PM, Anastasios Zouzias 
> >> wrote:
> >>
> >> Hi,
> >>
> >> I had a similar issue using 2.1.0 but not with Kafka. Updating to 2.1.1
> >> solved my issue. Can you try with 2.1.1 as well and report back?
> >>
> >> Best,
> >> Anastasios
> >>
> >> Am 17.09.2017 16:48 schrieb "HARSH TAKKAR" :
> >>
> >>
> >> Hi
> >>
> >> I am using spark 2.1.0 with scala  2.11.8, and while iterating over the
> >> partitions of each rdd in a dStream formed using KafkaUtils, i am
> getting
> >> the below exception, please suggest a fix.
> >>
> >> I have following config
> >>
> >> kafka :
> >> enable.auto.commit:"true",
> >> auto.commit.interval.ms:"1000",
> >> session.timeout.ms:"3",
> >>
> >> Spark:
> >>
> >> spark.streaming.backpressure.enabled=true
> >>
> >> spark.streaming.kafka.maxRatePerPartition=200
> >>
> >>
> >> Exception in task 0.2 in stage 3236.0 (TID 77795)
> >> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
> >> multi-threaded access
> >>
> >> --
> >> Kind Regards
> >> Harsh
> >>
> >>
> >
>


Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-19 Thread Aakash Basu
Hi,

I've a csv dataset which has a column with all the details of store open
and close timings as per dates, but the data is highly variant, as follows -


Mon-Fri 10am-9pm, Sat 10am-8pm, Sun 12pm-6pm
Mon-Sat 10am-8pm, Sun Closed
Mon-Sat 10am-8pm, Sun 10am-6pm
Mon-Friday 9-8 / Saturday 10-7 / Sunday 11-5
Mon-Sat 9am-8pm, Sun 10am-7pm
Mon-Sat 10am-8pm, 11am - 6pm
Mon-Fri 9am-6pm, Sat 10am-5pm, Sun Closed
Mon-Thur 10am-7pm, Fri 10am-5pm, Sat Closed, Sun 10am-5pm
Mon-Sat 10-7 Sun Closed
MON-FRI 10:00-8:00, SAT 10:00-7:00, SUN 12:00-5:00


I have to split the data of this one column into 14 columns, as -

Monday Open Time
Monday Close Time
Tuesday Open Time
Tuesday Close Time
Wednesday Open Time
Wednesday Close Time
Thursday Open Time
Thursday Close Time
Friday Open Time
Friday Close Time
Saturday Open Time
Saturday Close Time
Sunday Open Time
Sunday Close Time

Can someone please let me know if someone faced similar issue and also how
they resolved this in SparkSQL dataframes.

Using: CSV data, Spark 2.1, PySpark, using dataframes. (Tried using case
statement.)

Thanks,
Aakash.


Spark Executor - jaas.conf with useTicketCache=true

2017-09-19 Thread Hugo Reinwald
Hi All,

Apologies for cross posting, I posted this on Kafka user group. Is there
any way that I can use kerberos ticket cache of the spark executor for
reading from secured kafka? I am not 100% that the executor would do a
kinit, but I presume so to be able to run code / read hdfs etc as the user
that submitted the job.

I am connecting to a secured kafka cluster from spark. My jaas.conf looks
like below -
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
keyTab="./user.keytab"
principal="u...@example.com";
};

export KAFKA_OPTS="-Djava.security.auth.login.config=/home/user/jaas.conf"

I tested connectivity using kafka-console-consumer and I am able to read
data from kafka topic. However when I used the same in spark-submit using
the below options, I get a kerberos error -

spark-sbumit  --files jaas.conf#jaas.conf --driver-java-options "-Djava
.security.auth.login.config=./jaas.conf" --conf "spark.executor.
extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" 
*Could not login: the client is being asked for a password, but the Kafka
client code does not currently support obtaining a password from the user.
not available to garner  authentication information from the user*

My question - Can we not use the spark executor ticket cache (spark running
the job as "user" )? Do we always need to provide the keytab file also
using --files? I also tested using --principal u...@example.com --keytab
, but still got the same error. Is there any way that I can use the
ticketcache from spark  executor for kafka?

PS - I read this link - https://docs.confluent.io/2.0.0/kafka/sasl.html#
kerberos which says that *"For command-line utilities like
kafka-console-consumer or kafka-console-producer, kinit can be used along
with useTicketCache=true "*

Not sure if this is as per design or am I missing something.

Thanks,
Hugo