Re: Best alternative for Category Type in Spark Dataframe

Yan Facai Sat, 17 Jun 2017 15:55:08 -0700

To filter data, how about using sql?

df.createOrReplaceTempView("df")
val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN
(HAPPY,SAD,ANGRY,NEUTRAL,NA)")


https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
>     inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>    schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1...@gmail.com>
> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be restricted
>> to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhku...@gmail.com>
>> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com>
>>> wrote:
>>>
>>>> Hi Yan,
>>>>
>>>> Basically the reason I was looking for the categorical datatype is as
>>>> given here
>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>> ability to fix column values to specific categories. Is it possible to
>>>> create a user defined data type which could do so?
>>>>
>>>> Thanks and Regards,
>>>> Saatvik Shah
>>>>
>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com>
>>>> wrote:
>>>>
>>>>> You can use some Transformers to handle categorical data,
>>>>> For example,
>>>>> StringIndexer encodes a string column of labels to a column of label
>>>>> indices:
>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>
>>>>>
>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>>>>>> I have
>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>> support for
>>>>>> this same type in Spark. What is the best alternative?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>> Spark-Dataframe-tp28764.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Saatvik Shah,*
>>>> *1st  Year,*
>>>> *Masters in the School of Computer Science,*
>>>> *Carnegie Mellon University*
>>>>
>>>> *https://saatvikshah1994.github.io/
>>>> <https://saatvikshah1994.github.io/>*
>>>>
>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>
>
>

Re: Best alternative for Category Type in Spark Dataframe

Reply via email to