Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Saatvik Shah
Thanks guys,

You'll have given a number of options to work with.

The thing is that Im working in a production environment where it might be
necessary to ensure that no one erroneously inserts new records in those
specific columns which should be the Category data type. The best
alternative there would be to have a Category-like dataframe column
datatype, without the additional overhead of running a transformer. Is that
possible?

Thanks and Regards,
Saatvik

On Sat, Jun 17, 2017 at 11:15 PM, Pralabh Kumar 
wrote:

> make sense :)
>
> On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) 
> wrote:
>
>> Yes, perhaps we could use SQLTransformer as well.
>>
>> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>>
>> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
>> wrote:
>>
>>> Hi Yan
>>>
>>> Yes sql is good option , but if we have to create ML Pipeline , then
>>> having transformers and set it into pipeline stages ,would be better option
>>> .
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 To filter data, how about using sql?

 df.createOrReplaceTempView("df")
 val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
 (HAPPY,SAD,ANGRY,NEUTRAL,NA)")

 https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



 On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar  wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
> inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be
>> restricted to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>> pralabhku...@gmail.com> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>>> wrote:
>>>
 Hi Yan,

 Basically the reason I was looking for the categorical datatype is
 as given here
 :
 ability to fix column values to specific categories. Is it possible to
 create a user defined data type which could do so?

 Thanks and Regards,
 Saatvik Shah

 On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <
 facai@gmail.com> wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of
> label indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>> columns I have
>> is of the Category type in Pandas. But there does not seem to be
>> support for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> 
>> -
>> To unsubscribe e-mail: 

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
make sense :)

On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai)  wrote:

> Yes, perhaps we could use SQLTransformer as well.
>
> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>
> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
> wrote:
>
>> Hi Yan
>>
>> Yes sql is good option , but if we have to create ML Pipeline , then
>> having transformers and set it into pipeline stages ,would be better option
>> .
>>
>> Regards
>> Pralabh Kumar
>>
>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> To filter data, how about using sql?
>>>
>>> df.createOrReplaceTempView("df")
>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
>>> wrote:
>>>
 Hi Saatvik

 You can write your own transformer to make sure that column contains
 ,value which u provided , and filter out rows which doesn't follow the
 same.

 Something like this


 case class CategoryTransformer(override val uid : String) extends
 Transformer{
   override def transform(inputData: DataFrame): DataFrame = {
 inputData.select("col1").filter("col1 in ('happy')")
   }
   override def copy(extra: ParamMap): Transformer = ???
   @DeveloperApi
   override def transformSchema(schema: StructType): StructType ={
schema
   }
 }


 Usage

 val data = sc.parallelize(List("abce","happy")).toDF("col1")
 val trans = new CategoryTransformer("1")
 data.show()
 trans.transform(data).show()


 This transformer will make sure , you always have values in col1 as
 provided by you.


 Regards
 Pralabh Kumar

 On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
 saatvikshah1...@gmail.com> wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be
> restricted to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure
> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
> pralabhku...@gmail.com> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>> wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is
>>> as given here
>>> :
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) >> > wrote:
>>>
 You can use some Transformers to handle categorical data,
 For example,
 StringIndexer encodes a string column of labels to a column of
 label indices:
 http://spark.apache.org/docs/latest/ml-features.html#stringindexer


 On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
 saatvikshah1...@gmail.com> wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the
> columns I have
> is of the Category type in Pandas. But there does not seem to be
> support for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
> Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/
>>> *
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/
> *
>


>>>
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
Yes, perhaps we could use SQLTransformer as well.

http://spark.apache.org/docs/latest/ml-features.html#sqltransformer

On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar 
wrote:

> Hi Yan
>
> Yes sql is good option , but if we have to create ML Pipeline , then
> having transformers and set it into pipeline stages ,would be better option
> .
>
> Regards
> Pralabh Kumar
>
> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) 
> wrote:
>
>> To filter data, how about using sql?
>>
>> df.createOrReplaceTempView("df")
>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>
>>
>>
>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
>> wrote:
>>
>>> Hi Saatvik
>>>
>>> You can write your own transformer to make sure that column contains
>>> ,value which u provided , and filter out rows which doesn't follow the
>>> same.
>>>
>>> Something like this
>>>
>>>
>>> case class CategoryTransformer(override val uid : String) extends
>>> Transformer{
>>>   override def transform(inputData: DataFrame): DataFrame = {
>>> inputData.select("col1").filter("col1 in ('happy')")
>>>   }
>>>   override def copy(extra: ParamMap): Transformer = ???
>>>   @DeveloperApi
>>>   override def transformSchema(schema: StructType): StructType ={
>>>schema
>>>   }
>>> }
>>>
>>>
>>> Usage
>>>
>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>> val trans = new CategoryTransformer("1")
>>> data.show()
>>> trans.transform(data).show()
>>>
>>>
>>> This transformer will make sure , you always have values in col1 as
>>> provided by you.
>>>
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah >> > wrote:
>>>
 Hi Pralabh,

 I want the ability to create a column such that its values be
 restricted to a specific set of predefined values.
 For example, suppose I have a column called EMOTION: I want to ensure
 each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

 Thanks and Regards,
 Saatvik Shah


 On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar  wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
> wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> :
>> ability to fix column values to specific categories. Is it possible to
>> create a user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of
>>> label indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1...@gmail.com> wrote:
>>>
 Hi,
 I'm trying to convert a Pandas -> Spark dataframe. One of the
 columns I have
 is of the Category type in Pandas. But there does not seem to be
 support for
 this same type in Spark. What is the best alternative?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
 Spark-Dataframe-tp28764.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/
>> *
>>
>


 --
 *Saatvik Shah,*
 *1st  Year,*
 *Masters in the School of Computer Science,*
 *Carnegie Mellon University*

 *https://saatvikshah1994.github.io/
 *

>>>
>>>
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
Hi Yan

Yes sql is good option , but if we have to create ML Pipeline , then having
transformers and set it into pipeline stages ,would be better option .

Regards
Pralabh Kumar

On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai)  wrote:

> To filter data, how about using sql?
>
> df.createOrReplaceTempView("df")
> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>
>
>
> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
> wrote:
>
>> Hi Saatvik
>>
>> You can write your own transformer to make sure that column contains
>> ,value which u provided , and filter out rows which doesn't follow the
>> same.
>>
>> Something like this
>>
>>
>> case class CategoryTransformer(override val uid : String) extends
>> Transformer{
>>   override def transform(inputData: DataFrame): DataFrame = {
>> inputData.select("col1").filter("col1 in ('happy')")
>>   }
>>   override def copy(extra: ParamMap): Transformer = ???
>>   @DeveloperApi
>>   override def transformSchema(schema: StructType): StructType ={
>>schema
>>   }
>> }
>>
>>
>> Usage
>>
>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>> val trans = new CategoryTransformer("1")
>> data.show()
>> trans.transform(data).show()
>>
>>
>> This transformer will make sure , you always have values in col1 as
>> provided by you.
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
>> wrote:
>>
>>> Hi Pralabh,
>>>
>>> I want the ability to create a column such that its values be restricted
>>> to a specific set of predefined values.
>>> For example, suppose I have a column called EMOTION: I want to ensure
>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>>
>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
>>> wrote:
>>>
 Hi satvik

 Can u please provide an example of what exactly you want.



 On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
 wrote:

> Hi Yan,
>
> Basically the reason I was looking for the categorical datatype is as
> given here
> :
> ability to fix column values to specific categories. Is it possible to
> create a user defined data type which could do so?
>
> Thanks and Regards,
> Saatvik Shah
>
> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
> wrote:
>
>> You can use some Transformers to handle categorical data,
>> For example,
>> StringIndexer encodes a string column of labels to a column of label
>> indices:
>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>
>>
>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>> columns I have
>>> is of the Category type in Pandas. But there does not seem to be
>>> support for
>>> this same type in Spark. What is the best alternative?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>> Spark-Dataframe-tp28764.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/
> *
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ *
>>>
>>
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
To filter data, how about using sql?

df.createOrReplaceTempView("df")
val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN
(HAPPY,SAD,ANGRY,NEUTRAL,NA)")

https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar 
wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
> inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be restricted
>> to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
>> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" 
>>> wrote:
>>>
 Hi Yan,

 Basically the reason I was looking for the categorical datatype is as
 given here
 :
 ability to fix column values to specific categories. Is it possible to
 create a user defined data type which could do so?

 Thanks and Regards,
 Saatvik Shah

 On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
 wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>> I have
>> is of the Category type in Pandas. But there does not seem to be
>> support for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


 --
 *Saatvik Shah,*
 *1st  Year,*
 *Masters in the School of Computer Science,*
 *Carnegie Mellon University*

 *https://saatvikshah1994.github.io/
 *

>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ *
>>
>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi Saatvik

You can write your own transformer to make sure that column contains ,value
which u provided , and filter out rows which doesn't follow the same.

Something like this


case class CategoryTransformer(override val uid : String) extends
Transformer{
  override def transform(inputData: DataFrame): DataFrame = {
inputData.select("col1").filter("col1 in ('happy')")
  }
  override def copy(extra: ParamMap): Transformer = ???
  @DeveloperApi
  override def transformSchema(schema: StructType): StructType ={
   schema
  }
}


Usage

val data = sc.parallelize(List("abce","happy")).toDF("col1")
val trans = new CategoryTransformer("1")
data.show()
trans.transform(data).show()


This transformer will make sure , you always have values in col1 as
provided by you.


Regards
Pralabh Kumar

On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be restricted
> to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure each
> row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is as
>>> given here
>>> :
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 You can use some Transformers to handle categorical data,
 For example,
 StringIndexer encodes a string column of labels to a column of label
 indices:
 http://spark.apache.org/docs/latest/ml-features.html#stringindexer


 On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
 saatvikshah1...@gmail.com> wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
> I have
> is of the Category type in Pandas. But there does not seem to be
> support for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
> Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ *
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ *
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Pralabh,

I want the ability to create a column such that its values be restricted to
a specific set of predefined values.
For example, suppose I have a column called EMOTION: I want to ensure each
row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> : ability
>> to fix column values to specific categories. Is it possible to create a
>> user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of label
>>> indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1...@gmail.com> wrote:
>>>
 Hi,
 I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
 have
 is of the Category type in Pandas. But there does not seem to be
 support for
 this same type in Spark. What is the best alternative?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
 Spark-Dataframe-tp28764.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ *
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ *


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi satvik

Can u please provide an example of what exactly you want.



On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:

> Hi Yan,
>
> Basically the reason I was looking for the categorical datatype is as
> given here :
> ability to fix column values to specific categories. Is it possible to
> create a user defined data type which could do so?
>
> Thanks and Regards,
> Saatvik Shah
>
> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
> wrote:
>
>> You can use some Transformers to handle categorical data,
>> For example,
>> StringIndexer encodes a string column of labels to a column of label
>> indices:
>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>
>>
>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>>> have
>>> is of the Category type in Pandas. But there does not seem to be support
>>> for
>>> this same type in Spark. What is the best alternative?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>> Spark-Dataframe-tp28764.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ *
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Yan,

Basically the reason I was looking for the categorical datatype is as given
here :
ability to fix column values to specific categories. Is it possible to
create a user defined data type which could do so?

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai)  wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>> have
>> is of the Category type in Pandas. But there does not seem to be support
>> for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ *


Re: Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread Yan Facai
You can use some Transformers to handle categorical data,
For example,
StringIndexer encodes a string column of labels to a column of label
indices:
http://spark.apache.org/docs/latest/ml-features.html#stringindexer


On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994  wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
> have
> is of the Category type in Pandas. But there does not seem to be support
> for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-
> in-Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>