Re: MLlib, Java, and DataFrame

Marco Mistroni Fri, 22 Jul 2016 14:01:18 -0700

Hi Inam
  i sorted it.
 i reply to all, in case anyone else follow the blog and get into the same
issue


- First off, the Environment.I have tested the sample using purely
spark-1.6.1, no hive, no hadoop. I launched pyspark as follow  pyspark
--packages com.databricks:spark-csv_2.10:1.4.0

- Secondly, please note that when i do printSchema (at step 1) the column
'Churn' is listed as 'boolean', not as string like in the blog. this might
be due to the spark-csv version i am using (1.4.0)

>>> CV_data.printSchema()
root
 |-- State: string (nullable = true)
 |-- Account length: integer (nullable = true)
 |-- Area code: integer (nullable = true)
 |-- International plan: string (nullable = true)
 |-- Voice mail plan: string (nullable = true)
 |-- Number vmail messages: integer (nullable = true)
 |-- Total day minutes: double (nullable = true)
 |-- Total day calls: integer (nullable = true)
 |-- Total day charge: double (nullable = true)
 |-- Total eve minutes: double (nullable = true)
 |-- Total eve calls: integer (nullable = true)
 |-- Total eve charge: double (nullable = true)
 |-- Total night minutes: double (nullable = true)
 |-- Total night calls: integer (nullable = true)
 |-- Total night charge: double (nullable = true)
 |-- Total intl minutes: double (nullable = true)
 |-- Total intl calls: integer (nullable = true)
 |-- Total intl charge: double (nullable = true)
 |-- Customer service calls: integer (nullable = true)
 |-- Churn: boolean (nullable = true)



- Thirdly, at step 6, please replace the binary_map function with the
folloiwng

as i said,Churn is not a string columb but a boolean, and thefefore the
toNum function will fail big time.

binary_map = {'Yes':1.0, 'No':0.0, True:1.0, False:0.0}

I managed to arrive at step 7 without any issues (uhm i dont have
matplotlib so i skipped step 5, which i guess is irrelevant as it just
display the data rather than doing any logic)

Pls let me know if this fixes your problems..

hth

 marco












On Fri, Jul 22, 2016 at 6:34 PM, Inam Ur Rehman <inam.rehma...@gmail.com>
wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin <j...@jgp.net> wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mmistr...@gmail.com> wrote:
>>
>> Hello Jean
>>  you can take ur current DataFrame and send them to mllib (i was doing
>> that coz i dindt know the ml package),but the process is littlebit
>> cumbersome
>>
>>
>> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
>> 2. run your ML model
>>
>> i'd suggest you stick to DataFrame + ml package :)
>>
>> hth
>>
>>
>>
>> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <j...@jgp.net> wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> j...@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: MLlib, Java, and DataFrame

Reply via email to