Unsubscribe

2017-06-24 Thread Anita Tailor
Unsubscribe 

Sent from my iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can we access files on Cluster mode

2017-06-24 Thread Holden Karau
addFile is supposed to not depend on a shared FS unless the semantics have
changed recently.

On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
wrote:

> Hi Sudhir,
>
> I believe you have to use a shared file system that is accused by all
> nodes.
>
>
> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>
>
> I am new to Spark and i need some guidance on how to fetch files from
> --files option on Spark-Submit.
>
> I read on some forums that we can fetch the files from
> Spark.getFiles(fileName) and can use it in our code and all nodes should
> read it.
>
> But i am facing some issue
>
> Below is the command i am using
>
> spark-submit --deploy-mode cluster --class com.check.Driver --files
> /home/sql/first.sql test.jar 20170619
>
> so when i use SparkFiles.get(first.sql) , i should be able to read the
> file Path but it is throwing File not Found exception.
>
> I tried SpackContext.addFile(/home/sql/first.sql) and then
> SparkFiles.get(first.sql) but still the same error.
>
> Its working on the stand alone mode but not on cluster mode. Any help is
> appreciated.. Using Spark 2.1.0 and Scala 2.11
>
> Thanks.
>
>
> Regards,
> Sudhir K
>
>
>
> --
> Regards,
> Sudhir K
>
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How does HashPartitioner distribute data in Spark?

2017-06-24 Thread Russell Spitzer
Neither of your code examples invoke a repartitioning. Add in a repartition
command.

On Sat, Jun 24, 2017, 11:53 AM Vikash Pareek 
wrote:

> Hi Vadim,
>
> Thank you for your response.
>
> I would like to know how partitioner choose the key, If we look at my
> example then following question arises:
> 1. In case of rdd1, hash partitioning should calculate hashcode of key
> (i.e. *"aa"* in this case), so *all records should go to single partition*
> instead of uniform distribution?
>  2. In case of rdd2, there is no key value pair so how hash partitoning
> going to work i.e. *what is the key* to calculate hashcode?
>
>
>
> Best Regards,
>
>
> [image: InfoObjects Inc.] 
> Vikash Pareek
> Team Lead  *InfoObjects Inc.*
> Big Data Analytics
>
> m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan
> 302004
> w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com
>
>
>
>
> On Fri, Jun 23, 2017 at 10:38 PM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> This is the code that chooses the partition for a key:
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88
>>
>> it's basically `math.abs(key.hashCode % numberOfPartitions)`
>>
>> On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
>> vikash.par...@infoobjects.com> wrote:
>>
>>> I am trying to understand how spark partitoing works.
>>>
>>> To understand this I have following piece of code on spark 1.6
>>>
>>> def countByPartition1(rdd: RDD[(String, Int)]) = {
>>> rdd.mapPartitions(iter => Iterator(iter.length))
>>> }
>>> def countByPartition2(rdd: RDD[String]) = {
>>> rdd.mapPartitions(iter => Iterator(iter.length))
>>> }
>>>
>>> //RDDs Creation
>>> val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1),
>>> ("aa",
>>> 1)), 8)
>>> countByPartition(rdd1).collect()
>>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>>
>>> val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
>>> countByPartition(rdd2).collect()
>>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>>
>>> In both the cases data is distributed uniformaly.
>>> I do have following questions on the basis of above observation:
>>>
>>>  1. In case of rdd1, hash partitioning should calculate hashcode of key
>>> (i.e. "aa" in this case), so all records should go to single partition
>>> instead of uniform distribution?
>>>  2. In case of rdd2, there is no key value pair so how hash partitoning
>>> going to work i.e. what is the key to calculate hashcode?
>>>
>>> I have followed @zero323 answer but not getting answer of these.
>>>
>>> https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work
>>>
>>>
>>>
>>>
>>> -
>>>
>>> __Vikash Pareek
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-tp28785.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Can we access files on Cluster mode

2017-06-24 Thread varma dantuluri
Hi Sudhir,

I believe you have to use a shared file system that is accused by all nodes.


> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
> 
> 
> I am new to Spark and i need some guidance on how to fetch files from --files 
> option on Spark-Submit.
> 
> I read on some forums that we can fetch the files from 
> Spark.getFiles(fileName) and can use it in our code and all nodes should read 
> it.
> 
> But i am facing some issue
> 
> Below is the command i am using
> 
> spark-submit --deploy-mode cluster --class com.check.Driver --files 
> /home/sql/first.sql test.jar 20170619
> 
> so when i use SparkFiles.get(first.sql) , i should be able to read the file 
> Path but it is throwing File not Found exception.
> 
> I tried SpackContext.addFile(/home/sql/first.sql) and then 
> SparkFiles.get(first.sql) but still the same error.
> 
> Its working on the stand alone mode but not on cluster mode. Any help is 
> appreciated.. Using Spark 2.1.0 and Scala 2.11
> 
> Thanks.
> 
> 
> 
> Regards,
> Sudhir K
> 
> 
> 
> -- 
> Regards,
> Sudhir K



Fwd: Can we access files on Cluster mode

2017-06-24 Thread sudhir k
I am new to Spark and i need some guidance on how to fetch files from
--files option on Spark-Submit.

I read on some forums that we can fetch the files from
Spark.getFiles(fileName) and can use it in our code and all nodes should
read it.

But i am facing some issue

Below is the command i am using

spark-submit --deploy-mode cluster --class com.check.Driver --files
/home/sql/first.sql test.jar 20170619

so when i use SparkFiles.get(first.sql) , i should be able to read the file
Path but it is throwing File not Found exception.

I tried SpackContext.addFile(/home/sql/first.sql) and then
SparkFiles.get(first.sql) but still the same error.

Its working on the stand alone mode but not on cluster mode. Any help is
appreciated.. Using Spark 2.1.0 and Scala 2.11

Thanks.


Regards,
Sudhir K



-- 
Regards,
Sudhir K


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Koert Kuipers
Dataset/DataFrame has repartition (which can be used to partition by key)
and sortWithinPartitions.

see for example usage here:
https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala#L18

On Fri, Jun 23, 2017 at 5:43 PM, Keith Chapman 
wrote:

> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = new MyOwnPartitioner(outputPartitionCount)
> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>
> where myRdd is correctly formed as key, value pairs. I am looking convert
> this to use Dataset/Dataframe instead of RDDs, so my question is:
>
> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
> Can I accomplish the partial sort using mapPartitions on the resulting
> partitioned Dataset/Dataframe?
>
> Any thoughts?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


Re: How does HashPartitioner distribute data in Spark?

2017-06-24 Thread Vikash Pareek
Hi Vadim,

Thank you for your response.

I would like to know how partitioner choose the key, If we look at my
example then following question arises:
1. In case of rdd1, hash partitioning should calculate hashcode of key
(i.e. *"aa"* in this case), so *all records should go to single partition*
instead of uniform distribution?
 2. In case of rdd2, there is no key value pair so how hash partitoning
going to work i.e. *what is the key* to calculate hashcode?



Best Regards,


[image: InfoObjects Inc.] 
Vikash Pareek
Team Lead  *InfoObjects Inc.*
Big Data Analytics

m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan
302004
w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com




On Fri, Jun 23, 2017 at 10:38 PM, Vadim Semenov  wrote:

> This is the code that chooses the partition for a key: https://github.com/
> apache/spark/blob/master/core/src/main/scala/org/apache/
> spark/Partitioner.scala#L85-L88
>
> it's basically `math.abs(key.hashCode % numberOfPartitions)`
>
> On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
> vikash.par...@infoobjects.com> wrote:
>
>> I am trying to understand how spark partitoing works.
>>
>> To understand this I have following piece of code on spark 1.6
>>
>> def countByPartition1(rdd: RDD[(String, Int)]) = {
>> rdd.mapPartitions(iter => Iterator(iter.length))
>> }
>> def countByPartition2(rdd: RDD[String]) = {
>> rdd.mapPartitions(iter => Iterator(iter.length))
>> }
>>
>> //RDDs Creation
>> val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1),
>> ("aa",
>> 1)), 8)
>> countByPartition(rdd1).collect()
>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>
>> val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
>> countByPartition(rdd2).collect()
>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>
>> In both the cases data is distributed uniformaly.
>> I do have following questions on the basis of above observation:
>>
>>  1. In case of rdd1, hash partitioning should calculate hashcode of key
>> (i.e. "aa" in this case), so all records should go to single partition
>> instead of uniform distribution?
>>  2. In case of rdd2, there is no key value pair so how hash partitoning
>> going to work i.e. what is the key to calculate hashcode?
>>
>> I have followed @zero323 answer but not getting answer of these.
>> https://stackoverflow.com/questions/31424396/how-does-hashpa
>> rtitioner-work
>>
>>
>>
>>
>> -
>>
>> __Vikash Pareek
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-does-HashPartitioner-distribute-
>> data-in-Spark-tp28785.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Hi Nguyen,

This looks promising and seems like I could achieve it using cluster by.
Thanks for the pointer.

Regards,
Keith.

http://keith-chapman.com

On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan 
wrote:

> Hi Chapman,
> You can use "cluster by" to do what you want.
> https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/
>
> 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake :
>
>> I haven't worked with datasets but would this help
>> https://stackoverflow.com/questions/37513667/how-to-cre
>> ate-a-spark-dataset-from-an-rdd?
>>
>> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>>
>>> Hi,
>>>
>>> I have code that does the following using RDDs,
>>>
>>> val outputPartitionCount = 300
>>> val part = new MyOwnPartitioner(outputPartitionCount)
>>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>>
>>> where myRdd is correctly formed as key, value pairs. I am looking
>>> convert this to use Dataset/Dataframe instead of RDDs, so my question is:
>>>
>>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>>> Can I accomplish the partial sort using mapPartitions on the resulting
>>> partitioned Dataset/Dataframe?
>>>
>>> Any thoughts?
>>>
>>> Regards,
>>> Keith.
>>>
>>> http://keith-chapman.com
>>>
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Thanks for the pointer Saliya, I'm looking got an equivalent api in
dataset/dataframe for repartitionAndSortWithinPartitions, I've already
converted most of the RDD's to Dataframes.

Regards,
Keith.

http://keith-chapman.com

On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake  wrote:

> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd?
>
> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>
>> Hi,
>>
>> I have code that does the following using RDDs,
>>
>> val outputPartitionCount = 300
>> val part = new MyOwnPartitioner(outputPartitionCount)
>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>
>> where myRdd is correctly formed as key, value pairs. I am looking convert
>> this to use Dataset/Dataframe instead of RDDs, so my question is:
>>
>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>> Can I accomplish the partial sort using mapPartitions on the resulting
>> partitioned Dataset/Dataframe?
>>
>> Any thoughts?
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread nguyen duc Tuan
Hi Chapman,
You can use "cluster by" to do what you want.
https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/

2017-06-24 17:48 GMT+07:00 Saliya Ekanayake :

> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd?
>
> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>
>> Hi,
>>
>> I have code that does the following using RDDs,
>>
>> val outputPartitionCount = 300
>> val part = new MyOwnPartitioner(outputPartitionCount)
>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>
>> where myRdd is correctly formed as key, value pairs. I am looking convert
>> this to use Dataset/Dataframe instead of RDDs, so my question is:
>>
>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>> Can I accomplish the partial sort using mapPartitions on the resulting
>> partitioned Dataset/Dataframe?
>>
>> Any thoughts?
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Saliya Ekanayake
I haven't worked with datasets but would this help
https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd
?

On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:

> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = new MyOwnPartitioner(outputPartitionCount)
> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>
> where myRdd is correctly formed as key, value pairs. I am looking convert
> this to use Dataset/Dataframe instead of RDDs, so my question is:
>
> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
> Can I accomplish the partial sort using mapPartitions on the resulting
> partitioned Dataset/Dataframe?
>
> Any thoughts?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


issue about the windows slice of stream

2017-06-24 Thread ??????????
Hi all,
I found an issue about the windows slice of dstream.
My code is :


ssc = new StreamingContext( conf, Seconds(1))


val content = ssc.socketTextStream('ip','port')
content.countByValueAndWindow( Seconds(2),  Seconds(8)).foreach( println())
The key is that slide is greater than windows.
I checked the output.The result from  foreach( println()) was wrong.
I found the stream was cut apart wrong.
Can I open a JIRA please?


thanks
Fei Shao