Re: createDataFrame allows column names as second param in Python not in Scala

Reynold Xin Sun, 03 May 2015 19:24:03 -0700

We can't drop the existing createDataFrame one, since it breaks API
compatibility, and the existing one also automatically infers the column
name for case classes (in that case users most likely won't be declaring
names directly). If this is really a problem, we should just create a new
function (maybe more than one, since you could argue the one for Seq should
also have that ...).




On Sun, May 3, 2015 at 2:13 AM, Olivier Girardot <
[email protected]> wrote:

> I have the perfect counter example where some of the data scientists
> prototype in Python and the production materials is done in Scala.
> But I get your point, as a matter of fact I realised the toDF method took
> parameters a little while after posting this.
> However the toDF still needs you to go from a List to an RDD, or create a
> useless Dataframe and call toDF on it re-creating a complete data
> structure. I just feel that the createDataFrame(_: Seq) is not really
> useful as it is, because I think there are practically no circumstances
> where you'd want to create a DataFrame without column names.
>
> I'm not implying a n-th overloaded method should be created, rather than
> change the signature of the existing method with an optional Seq of column
> names.
>
> Regards,
>
> Olivier.
>
> Le dim. 3 mai 2015 à 07:44, Reynold Xin <[email protected]> a écrit :
>
>> Part of the reason is that it is really easy to just call toDF on Scala,
>> and we already have a lot of createDataFrame functions.
>>
>> (You might find some of the cross-language differences confusing, but I'd
>> argue most real users just stick to one language, and developers or
>> trainers are the only ones that need to constantly switch between
>> languages).
>>
>> On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot <
>> [email protected]> wrote:
>>
>>> Hi everyone,
>>> SQLContext.createDataFrame has different behaviour in Scala or Python :
>>>
>>> >>> l = [('Alice', 1)]
>>> >>> sqlContext.createDataFrame(l).collect()
>>> [Row(_1=u'Alice', _2=1)]
>>> >>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
>>> [Row(name=u'Alice', age=1)]
>>>
>>> and in Scala :
>>>
>>> scala> val data = List(("Alice", 1), ("Wonderland", 0))
>>> scala> sqlContext.createDataFrame(data, List("name", "score"))
>>> <console>:28: error: overloaded method value createDataFrame with
>>> alternatives: ... cannot be applied to ...
>>>
>>> What do you think about allowing in Scala too to have a Seq of column
>>> names
>>> for the sake of consistency ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>

Re: createDataFrame allows column names as second param in Python not in Scala

Reply via email to