Re: Best practice for preprocessing feature with DataFrame

Divya Gehlot Wed, 16 Nov 2016 20:24:57 -0800

Hi,

You can use the Column functions provided by Spark API


https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html

Hope this helps .

Thanks,
Divya


On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:

> Hi,
> I have a sample, like:
> +---+------+--------------------+
> |age|gender|             city_id|
> +---+------+--------------------+
> |   |     1|1042015:city_2044...|
> |90s|     2|1042015:city_2035...|
> |80s|     2|1042015:city_2061...|
> +---+------+--------------------+
>
> and expectation is:
> "age":  90s -> 90, 80s -> 80
> "gender": 1 -> "male", 2 -> "female"
>
> I have two solutions:
> 1. Handle each column separately,  and then join all by index.
>     val age = input.select("age").map(...)
>     val gender = input.select("gender").map(...)
>     val result = ...
>
> 2. Write utf function for each column, and then use in together:
>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>
> However, both are awkward,
>
> Does anyone have a better work flow?
> Write some custom Transforms and use pipeline?
>
> Thanks.
>
>
>
>

Re: Best practice for preprocessing feature with DataFrame

Reply via email to