Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Saatvik Shah
Thanks guys,

You'll have given a number of options to work with.

The thing is that Im working in a production environment where it might be
necessary to ensure that no one erroneously inserts new records in those
specific columns which should be the Category data type. The best
alternative there would be to have a Category-like dataframe column
datatype, without the additional overhead of running a transformer. Is that
possible?

Thanks and Regards,
Saatvik

On Sat, Jun 17, 2017 at 11:15 PM, Pralabh Kumar 
wrote:

> make sense :)
>
> On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) 
> wrote:
>
>> Yes, perhaps we could use SQLTransformer as well.
>>
>> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>>
>> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
>> wrote:
>>
>>> Hi Yan
>>>
>>> Yes sql is good option , but if we have to create ML Pipeline , then
>>> having transformers and set it into pipeline stages ,would be better option
>>> .
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 To filter data, how about using sql?

 df.createOrReplaceTempView("df")
 val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
 (HAPPY,SAD,ANGRY,NEUTRAL,NA)")

 https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



 On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar  wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
> inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be
>> restricted to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>> pralabhku...@gmail.com> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>>> wrote:
>>>
 Hi Yan,

 Basically the reason I was looking for the categorical datatype is
 as given here
 :
 ability to fix column values to specific categories. Is it possible to
 create a user defined data type which could do so?

 Thanks and Regards,
 Saatvik Shah

 On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <
 facai@gmail.com> wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of
> label indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>> columns I have
>> is of the Category type in Pandas. But there does not seem to be
>> support for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> 
>> -
>> To unsubscribe e-mail: 

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
make sense :)

On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai)  wrote:

> Yes, perhaps we could use SQLTransformer as well.
>
> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>
> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
> wrote:
>
>> Hi Yan
>>
>> Yes sql is good option , but if we have to create ML Pipeline , then
>> having transformers and set it into pipeline stages ,would be better option
>> .
>>
>> Regards
>> Pralabh Kumar
>>
>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> To filter data, how about using sql?
>>>
>>> df.createOrReplaceTempView("df")
>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
>>> wrote:
>>>
 Hi Saatvik

 You can write your own transformer to make sure that column contains
 ,value which u provided , and filter out rows which doesn't follow the
 same.

 Something like this


 case class CategoryTransformer(override val uid : String) extends
 Transformer{
   override def transform(inputData: DataFrame): DataFrame = {
 inputData.select("col1").filter("col1 in ('happy')")
   }
   override def copy(extra: ParamMap): Transformer = ???
   @DeveloperApi
   override def transformSchema(schema: StructType): StructType ={
schema
   }
 }


 Usage

 val data = sc.parallelize(List("abce","happy")).toDF("col1")
 val trans = new CategoryTransformer("1")
 data.show()
 trans.transform(data).show()


 This transformer will make sure , you always have values in col1 as
 provided by you.


 Regards
 Pralabh Kumar

 On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
 saatvikshah1...@gmail.com> wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be
> restricted to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure
> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
> pralabhku...@gmail.com> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>> wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is
>>> as given here
>>> :
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) >> > wrote:
>>>
 You can use some Transformers to handle categorical data,
 For example,
 StringIndexer encodes a string column of labels to a column of
 label indices:
 http://spark.apache.org/docs/latest/ml-features.html#stringindexer


 On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
 saatvikshah1...@gmail.com> wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the
> columns I have
> is of the Category type in Pandas. But there does not seem to be
> support for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
> Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/
>>> *
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/
> *
>


>>>
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
Yes, perhaps we could use SQLTransformer as well.

http://spark.apache.org/docs/latest/ml-features.html#sqltransformer

On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
wrote:

> Hi Yan
>
> Yes sql is good option , but if we have to create ML Pipeline , then
> having transformers and set it into pipeline stages ,would be better option
> .
>
> Regards
> Pralabh Kumar
>
> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
> wrote:
>
>> To filter data, how about using sql?
>>
>> df.createOrReplaceTempView("df")
>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>
>>
>>
>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
>> wrote:
>>
>>> Hi Saatvik
>>>
>>> You can write your own transformer to make sure that column contains
>>> ,value which u provided , and filter out rows which doesn't follow the
>>> same.
>>>
>>> Something like this
>>>
>>>
>>> case class CategoryTransformer(override val uid : String) extends
>>> Transformer{
>>>   override def transform(inputData: DataFrame): DataFrame = {
>>> inputData.select("col1").filter("col1 in ('happy')")
>>>   }
>>>   override def copy(extra: ParamMap): Transformer = ???
>>>   @DeveloperApi
>>>   override def transformSchema(schema: StructType): StructType ={
>>>schema
>>>   }
>>> }
>>>
>>>
>>> Usage
>>>
>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>> val trans = new CategoryTransformer("1")
>>> data.show()
>>> trans.transform(data).show()
>>>
>>>
>>> This transformer will make sure , you always have values in col1 as
>>> provided by you.
>>>
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah >> > wrote:
>>>
 Hi Pralabh,

 I want the ability to create a column such that its values be
 restricted to a specific set of predefined values.
 For example, suppose I have a column called EMOTION: I want to ensure
 each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

 Thanks and Regards,
 Saatvik Shah


 On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar  wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
> wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> :
>> ability to fix column values to specific categories. Is it possible to
>> create a user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of
>>> label indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1...@gmail.com> wrote:
>>>
 Hi,
 I'm trying to convert a Pandas -> Spark dataframe. One of the
 columns I have
 is of the Category type in Pandas. But there does not seem to be
 support for
 this same type in Spark. What is the best alternative?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
 Spark-Dataframe-tp28764.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/
>> *
>>
>


 --
 *Saatvik Shah,*
 *1st  Year,*
 *Masters in the School of Computer Science,*
 *Carnegie Mellon University*

 *https://saatvikshah1994.github.io/
 *

>>>
>>>
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
Hi Yan

Yes sql is good option , but if we have to create ML Pipeline , then having
transformers and set it into pipeline stages ,would be better option .

Regards
Pralabh Kumar

On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai)  wrote:

> To filter data, how about using sql?
>
> df.createOrReplaceTempView("df")
> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>
>
>
> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
> wrote:
>
>> Hi Saatvik
>>
>> You can write your own transformer to make sure that column contains
>> ,value which u provided , and filter out rows which doesn't follow the
>> same.
>>
>> Something like this
>>
>>
>> case class CategoryTransformer(override val uid : String) extends
>> Transformer{
>>   override def transform(inputData: DataFrame): DataFrame = {
>> inputData.select("col1").filter("col1 in ('happy')")
>>   }
>>   override def copy(extra: ParamMap): Transformer = ???
>>   @DeveloperApi
>>   override def transformSchema(schema: StructType): StructType ={
>>schema
>>   }
>> }
>>
>>
>> Usage
>>
>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>> val trans = new CategoryTransformer("1")
>> data.show()
>> trans.transform(data).show()
>>
>>
>> This transformer will make sure , you always have values in col1 as
>> provided by you.
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
>> wrote:
>>
>>> Hi Pralabh,
>>>
>>> I want the ability to create a column such that its values be restricted
>>> to a specific set of predefined values.
>>> For example, suppose I have a column called EMOTION: I want to ensure
>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>>
>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
>>> wrote:
>>>
 Hi satvik

 Can u please provide an example of what exactly you want.



 On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
 wrote:

> Hi Yan,
>
> Basically the reason I was looking for the categorical datatype is as
> given here
> :
> ability to fix column values to specific categories. Is it possible to
> create a user defined data type which could do so?
>
> Thanks and Regards,
> Saatvik Shah
>
> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
> wrote:
>
>> You can use some Transformers to handle categorical data,
>> For example,
>> StringIndexer encodes a string column of labels to a column of label
>> indices:
>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>
>>
>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>> columns I have
>>> is of the Category type in Pandas. But there does not seem to be
>>> support for
>>> this same type in Spark. What is the best alternative?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>> Spark-Dataframe-tp28764.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/
> *
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ *
>>>
>>
>>
>


Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-17 Thread Kanagha Kumar
Hi,

Bumping up again! Why does spark modules depend upon scala2.11 versions
inspite of changing pom.xmls using ./dev/change-scala-version.sh 2.10.
Appreciate any quick help!!

Thanks

On Fri, Jun 16, 2017 at 2:59 PM, Kanagha Kumar 
wrote:

> Hey all,
>
>
> I'm trying to use Spark 2.0.2 with scala 2.10 by following this
> https://spark.apache.org/docs/2.0.2/building-spark.
> html#building-for-scala-210
>
> ./dev/change-scala-version.sh 2.10
> ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
>
>
> I could build the distribution successfully using
> bash -xv dev/make-distribution.sh --tgz  -Dscala-2.10 -DskipTests
>
> But, when I am trying to maven release, it keeps failing with the error
> using the command:
>
>
> Executing Maven:  -B -f pom.xml  -DscmCommentPrefix=[maven-release-plugin]
> -e  -Dscala-2.10 -Pyarn -Phadoop-2.7 -Phadoop-provided -DskipTests
> -Dresume=false -U -X *release:prepare release:perform*
>
> Failed to execute goal on project spark-sketch_2.10: Could not resolve
> dependencies for project 
> org.apache.spark:spark-sketch_2.10:jar:2.0.2-sfdc-3.0.0:
> *Failure to find org.apache.spark:spark-tags_2.11:jar:2.0.2-sfdc-3.0.0*
> in  was cached in the local repository, resolution will
> not be reattempted until the update interval of nexus has elapsed or
> updates are forced - [Help 1]
>
>
> Why does spark-sketch depend upon spark-tags_2.11 when I have already
> compiled against scala 2.10?? Any pointers would be helpful.
> Thanks
> Kanagha
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
To filter data, how about using sql?

df.createOrReplaceTempView("df")
val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN
(HAPPY,SAD,ANGRY,NEUTRAL,NA)")

https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
> inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be restricted
>> to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
>> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>>> wrote:
>>>
 Hi Yan,

 Basically the reason I was looking for the categorical datatype is as
 given here
 :
 ability to fix column values to specific categories. Is it possible to
 create a user defined data type which could do so?

 Thanks and Regards,
 Saatvik Shah

 On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
 wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>> I have
>> is of the Category type in Pandas. But there does not seem to be
>> support for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


 --
 *Saatvik Shah,*
 *1st  Year,*
 *Masters in the School of Computer Science,*
 *Carnegie Mellon University*

 *https://saatvikshah1994.github.io/
 *

>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ *
>>
>
>


Re: Spark-Kafka integration - build failing with sbt

2017-06-17 Thread karan alang
Thanks, Cody .. yes, was able to fix that.

On Sat, Jun 17, 2017 at 1:18 PM, Cody Koeninger  wrote:

> There are different projects for different versions of kafka,
> spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10
>
> See
>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>
> On Fri, Jun 16, 2017 at 6:51 PM, karan alang 
> wrote:
> > I'm trying to compile kafka & Spark Streaming integration code i.e.
> reading
> > from Kafka using Spark Streaming,
> >   and the sbt build is failing with error -
> >
> >   [error] (*:update) sbt.ResolveException: unresolved dependency:
> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
> >
> >   Scala version -> 2.10.7
> >   Spark Version -> 2.1.0
> >   Kafka version -> 0.9
> >   sbt version -> 0.13
> >
> > Contents of sbt files is as shown below ->
> >
> > 1)
> >   vi spark_kafka_code/project/plugins.sbt
> >
> >   addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
> >
> >  2)
> >   vi spark_kafka_code/sparkkafka.sbt
> >
> > import AssemblyKeys._
> > assemblySettings
> >
> > name := "SparkKafka Project"
> >
> > version := "1.0"
> > scalaVersion := "2.11.7"
> >
> > val sparkVers = "2.1.0"
> >
> > // Base Spark-provided dependencies
> > libraryDependencies ++= Seq(
> >   "org.apache.spark" %% "spark-core" % sparkVers % "provided",
> >   "org.apache.spark" %% "spark-streaming" % sparkVers % "provided",
> >   "org.apache.spark" %% "spark-streaming-kafka" % sparkVers)
> >
> > mergeStrategy in assembly := {
> >   case m if m.toLowerCase.endsWith("manifest.mf") =>
> MergeStrategy.discard
> >   case m if m.toLowerCase.startsWith("META-INF")  =>
> MergeStrategy.discard
> >   case "reference.conf"   => MergeStrategy.concat
> >   case m if m.endsWith("UnusedStubClass.class")   =>
> MergeStrategy.discard
> >   case _ => MergeStrategy.first
> > }
> >
> >   i launch sbt, and then try to create an eclipse project, complete
> error is
> > as shown below -
> >
> >   -
> >
> >   sbt
> > [info] Loading global plugins from /Users/karanalang/.sbt/0.13/plugins
> > [info] Loading project definition from
> > /Users/karanalang/Documents/Technology/Coursera_spark_
> scala/spark_kafka_code/project
> > [info] Set current project to SparkKafka Project (in build
> > file:/Users/karanalang/Documents/Technology/Coursera_
> spark_scala/spark_kafka_code/)
> >> eclipse
> > [info] About to create Eclipse project files for your project(s).
> > [info] Updating
> > {file:/Users/karanalang/Documents/Technology/Coursera_
> spark_scala/spark_kafka_code/}spark_kafka_code...
> > [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
> > [warn] module not found:
> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0
> > [warn]  local: tried
> > [warn]
> > /Users/karanalang/.ivy2/local/org.apache.spark/spark-
> streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> > [warn]  activator-launcher-local: tried
> > [warn]
> > /Users/karanalang/.activator/repository/org.apache.spark/
> spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> > [warn]  activator-local: tried
> > [warn]
> > /Users/karanalang/Documents/Technology/SCALA/activator-
> dist-1.3.10/repository/org.apache.spark/spark-streaming-
> kafka_2.11/2.1.0/ivys/ivy.xml
> > [warn]  public: tried
> > [warn]
> > https://repo1.maven.org/maven2/org/apache/spark/spark-
> streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
> > [warn]  typesafe-releases: tried
> > [warn]
> > http://repo.typesafe.com/typesafe/releases/org/apache/
> spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-
> kafka_2.11-2.1.0.pom
> > [warn]  typesafe-ivy-releasez: tried
> > [warn]
> > http://repo.typesafe.com/typesafe/ivy-releases/org.
> apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> > [info] Resolving jline#jline;2.12.1 ...
> > [warn] ::
> > [warn] ::  UNRESOLVED DEPENDENCIES ::
> > [warn] ::
> > [warn] :: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not
> found
> > [warn] ::
> > [warn]
> > [warn] Note: Unresolved dependencies path:
> > [warn] org.apache.spark:spark-streaming-kafka_2.11:2.1.0
> > (/Users/karanalang/Documents/Technology/Coursera_spark_
> scala/spark_kafka_code/sparkkafka.sbt#L12-16)
> > [warn]   +- sparkkafka-project:sparkkafka-project_2.11:1.0
> > [trace] Stack trace suppressed: run last *:update for the full output.
> > [error] (*:update) sbt.ResolveException: unresolved dependency:
> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
> > [info] Updating
> > {file:/Users/karanalang/Documents/Technology/Coursera_
> spark_scala/spark_kafka_code/}spark_kafka_code...
> > [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
> > [warn] module not found:
> > 

Re: Spark-Kafka integration - build failing with sbt

2017-06-17 Thread Cody Koeninger
There are different projects for different versions of kafka,
spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10

See

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

On Fri, Jun 16, 2017 at 6:51 PM, karan alang  wrote:
> I'm trying to compile kafka & Spark Streaming integration code i.e. reading
> from Kafka using Spark Streaming,
>   and the sbt build is failing with error -
>
>   [error] (*:update) sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
>
>   Scala version -> 2.10.7
>   Spark Version -> 2.1.0
>   Kafka version -> 0.9
>   sbt version -> 0.13
>
> Contents of sbt files is as shown below ->
>
> 1)
>   vi spark_kafka_code/project/plugins.sbt
>
>   addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
>
>  2)
>   vi spark_kafka_code/sparkkafka.sbt
>
> import AssemblyKeys._
> assemblySettings
>
> name := "SparkKafka Project"
>
> version := "1.0"
> scalaVersion := "2.11.7"
>
> val sparkVers = "2.1.0"
>
> // Base Spark-provided dependencies
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % sparkVers % "provided",
>   "org.apache.spark" %% "spark-streaming" % sparkVers % "provided",
>   "org.apache.spark" %% "spark-streaming-kafka" % sparkVers)
>
> mergeStrategy in assembly := {
>   case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
>   case m if m.toLowerCase.startsWith("META-INF")  => MergeStrategy.discard
>   case "reference.conf"   => MergeStrategy.concat
>   case m if m.endsWith("UnusedStubClass.class")   => MergeStrategy.discard
>   case _ => MergeStrategy.first
> }
>
>   i launch sbt, and then try to create an eclipse project, complete error is
> as shown below -
>
>   -
>
>   sbt
> [info] Loading global plugins from /Users/karanalang/.sbt/0.13/plugins
> [info] Loading project definition from
> /Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/project
> [info] Set current project to SparkKafka Project (in build
> file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/)
>> eclipse
> [info] About to create Eclipse project files for your project(s).
> [info] Updating
> {file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/}spark_kafka_code...
> [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
> [warn] module not found:
> org.apache.spark#spark-streaming-kafka_2.11;2.1.0
> [warn]  local: tried
> [warn]
> /Users/karanalang/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [warn]  activator-launcher-local: tried
> [warn]
> /Users/karanalang/.activator/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [warn]  activator-local: tried
> [warn]
> /Users/karanalang/Documents/Technology/SCALA/activator-dist-1.3.10/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [warn]  public: tried
> [warn]
> https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
> [warn]  typesafe-releases: tried
> [warn]
> http://repo.typesafe.com/typesafe/releases/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
> [warn]  typesafe-ivy-releasez: tried
> [warn]
> http://repo.typesafe.com/typesafe/ivy-releases/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [info] Resolving jline#jline;2.12.1 ...
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
> [warn] ::
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] org.apache.spark:spark-streaming-kafka_2.11:2.1.0
> (/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/sparkkafka.sbt#L12-16)
> [warn]   +- sparkkafka-project:sparkkafka-project_2.11:1.0
> [trace] Stack trace suppressed: run last *:update for the full output.
> [error] (*:update) sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
> [info] Updating
> {file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/}spark_kafka_code...
> [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
> [warn] module not found:
> org.apache.spark#spark-streaming-kafka_2.11;2.1.0
> [warn]  local: tried
> [warn]
> /Users/karanalang/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [warn]  activator-launcher-local: tried
> [warn]
> /Users/karanalang/.activator/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
> [warn]  activator-local: tried
> [warn]
> 

Build spark without hive issue, spark-sql doesn't work.

2017-06-17 Thread wuchang
I want to build hive and spark to make my hive based on spark engine.
I choose Hive 2.3.0 and Spark 2.0.0, which is claimed to be compatible by hive 
official document.
According to the hive officials document ,I  have to build spark without hive 
profile to avoid the conflict between original hive and spark-integrated hive. 
Yes, I build successfully , but then the problem comes:I cannot use spark-sql 
anymore because spark-sql relies on the hive library and my spark is a no-hive 
build.


[appuser@ab-10-11-22-209 spark]$ spark-sql
java.lang.ClassNotFoundException: 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Failed to load main class 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.

How can I build and setup spark and to make hive on spark
Work properly and my spark-sql、pyspark and spark-shell work properly?


I don’t know the relationship between spark-integrated hive and original hive. 
Below is the spark-integrated hive jars:

hive-beeline-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar
hive-jdbc-1.2.1.spark2.jar
hive-metastore-1.2.1.spark2.jar
spark-hive_2.11-2.0.0.jar
spark-hive-thriftserver_2.11-2.0.0.jar

It seems that Spark 2.0.0 relies on hive 1.2.1.

Build spark without hive issue, spark-sql doesn't work.

2017-06-17 Thread wuchang
I want to build hive and spark to make my hive work on spark engine.
I choose Hive 2.3.0 and Spark 2.0.0, which is claimed to be compatible by hive 
official document.
According to the hive officials document ,I  have to build spark without hive 
profile to avoid the conflict between original hive and spark-integrated hive. 
Yes, I build successfully , but then the problem comes:I cannot use spark-sql 
anymore because spark-sql relies on the hive library and my spark is a no-hive 
build.

I don’t know the relationship between hive-integrated hive and original hive. 
Below is the spark-integrated hive jars:

hive-beeline-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar
hive-jdbc-1.2.1.spark2.jar
hive-metastore-1.2.1.spark2.jar
spark-hive_2.11-2.0.0.jar
spark-hive-thriftserver_2.11-2.0.0.jar

It seems that Spark 2.0.0 relies on hive 1.2.1.
How can I build and setup spark and to make hive on spark
Work properly and my spark-sql、pyspark and spark-shell work properly?



difference between spark-integrated hive and original hive

2017-06-17 Thread wuchang
I want to build hive and spark to make my hive based on spark engine.
I choose Hive 2.3.0 and Spark 2.0.0, which is claimed to be compatible by hive 
official document.
According to the hive officials document ,I  have to build spark without hive 
profile to avoid the conflict between original hive and spark-integrated hive. 
Yes, I build successfully , but then the problem comes:I cannot use spark-sql 
anymore because spark-sql relies on the hive library and my spark is a no-hive 
build.

I don’t know the relationship between hive-integrated hive and original hive. 
Below is the spark-integrated hive jars:

hive-beeline-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar
hive-jdbc-1.2.1.spark2.jar
hive-metastore-1.2.1.spark2.jar
spark-hive_2.11-2.0.0.jar
spark-hive-thriftserver_2.11-2.0.0.jar
It seems that Spark 2.0.0 relies on hive 1.2.1。

Can I just add my 2.3.0 hive's libs to the classpath of Spark?