Re: Spark SQL DataFrame: Nullable column and filtering

Martin Senne Thu, 30 Jul 2015 16:59:25 -0700

Dear Michael, dear all,

distinguishing those records that have a match in mapping from those that
don't is the crucial point.


Record(x : Int,  a: String)
Mapping(x: Int, y: Int)

Thus

Record(1, "hello")
Record(2, "bob")
Mapping(2, 5)

yield (2, "bob", 5) on an inner join.
BUT I'm also interested in (1, "hello", null) as there is no counterpart in
mapping (this is the left outer join part)

I need to distinguish 1 and 2 because of later inserts (case 1, hello) or
updates (case 2, bon).

Cheers and thanks,

Martin
Am 30.07.2015 22:58 schrieb "Michael Armbrust" <mich...@databricks.com>:
>
> Perhaps I'm missing what you are trying to accomplish, but if you'd like
to avoid the null values do an inner join instead of an outer join.
>
> Additionally, I'm confused about how the result
of joinedDF.filter(joinedDF("y").isNotNull).show still contains null values
in the column y. This doesn't really have anything to do with nullable,
which is only a hint to the system so that we can avoid null checking when
we know that there are no null values. If you provide the full code i can
try and see if this is a bug.
>
> On Thu, Jul 30, 2015 at 11:53 AM, Martin Senne <
martin.se...@googlemail.com> wrote:
>>
>> Dear Michael, dear all,
>>
>> motivation:
>>
>> object OtherEntities {
>>
>>   case class Record( x:Int, a: String)
>>   case class Mapping( x: Int, y: Int )
>>
>>   val records = Seq( Record(1, "hello"), Record(2, "bob"))
>>   val mappings = Seq( Mapping(2, 5) )
>> }
>>
>> Now I want to perform an left outer join on records and mappings (with
the ON JOIN criterion on columns (recordDF("x") === mappingDF("x") ....
shorthand is in leftOuterJoinWithRemovalOfEqualColumn
>>
>> val sqlContext = new SQLContext(sc)
>> // used to implicitly convert an RDD to a DataFrame.
>> import sqlContext.implicits._
>>
>> val recordDF= sc.parallelize(OtherEntities.records, 4).toDF()
>> val mappingDF = sc.parallelize(OtherEntities.mappings, 4).toDF()
>>
>> val joinedDF = recordDF.leftOuterJoinWithRemovalOfEqualColumn(
mappingDF, "x")
>>
>> joinedDF.filter(joinedDF("y").isNotNull).show
>>
>>
>> Currently, the output is
>>
>>
+-+-----+----+

>> |x|    a|   y|
>> +-+-----+----+
>> |1|hello|null|
>> |2|  bob|   5|
>> +-+-----+----+
>>
>> instead of
>>
>>
+-+---+-+

>> |x|  a|y|
>> +-+---+-+
>> |2|bob|5|
>> +-+---+-+
>>
>> The last output can be achieved by the method of changing nullable=false
to nullable=true described in my first post.
>>
>> Thus, I need this schema modification as to make outer joins work.
>>
>> Cheers and thanks,
>>
>> Martin
>>
>>
>>
>> 2015-07-30 20:23 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>> We don't yet updated nullability information based on predicates as we
don't actually leverage this information in many places yet.  Why do you
want to update the schema?
>>>
>>> On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 <
martin.se...@googlemail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> 1. *Columns in dataframes can be nullable and not nullable. Having a
>>>> nullable column of Doubles, I can use the following Scala code to
filter all
>>>> "non-null" rows:*
>>>>
>>>>   val df = ..... // some code that creates a DataFrame
>>>>   df.filter( df("columnname").isNotNull() )
>>>>
>>>> +-+-----+----+
>>>> |x|    a|   y|
>>>> +-+-----+----+
>>>> |1|hello|null|
>>>> |2|  bob|   5|
>>>> +-+-----+----+
>>>>
>>>> root
>>>>  |-- x: integer (nullable = false)
>>>>  |-- a: string (nullable = true)
>>>>  |-- y: integer (nullable = true)
>>>>
>>>> And with the filter expression
>>>> +-+---+-+
>>>> |x|  a|y|
>>>> +-+---+-+
>>>> |2|bob|5|
>>>> +-+---+-+
>>>>
>>>>
>>>> Unfortunetaly and while this is a true for a nullable column
(according to
>>>> df.printSchema), it is not true for a column that is not nullable:
>>>>
>>>>
>>>> +-+-----+----+
>>>> |x|    a|   y|
>>>> +-+-----+----+
>>>> |1|hello|null|
>>>> |2|  bob|   5|
>>>> +-+-----+----+
>>>>
>>>> root
>>>>  |-- x: integer (nullable = false)
>>>>  |-- a: string (nullable = true)
>>>>  |-- y: integer (nullable = false)
>>>>
>>>> +-+-----+----+
>>>> |x|    a|   y|
>>>> +-+-----+----+
>>>> |1|hello|null|
>>>> |2|  bob|   5|
>>>> +-+-----+----+
>>>>
>>>> such that the output is not affected by the filter. Is this intended?
>>>>
>>>>
>>>> 2. *What is the cheapest (in sense of performance) to turn a
non-nullable
>>>> column into a nullable column?
>>>> A came uo with this:*
>>>>
>>>>   /**
>>>>    * Set, if a column is nullable.
>>>>    * @param df source DataFrame
>>>>    * @param cn is the column name to change
>>>>    * @param nullable is the flag to set, such that the column is either
>>>> nullable or not
>>>>    */
>>>>   def setNullableStateOfColumn( df: DataFrame, cn: String, nullable:
>>>> Boolean) : DataFrame = {
>>>>
>>>>     val schema = df.schema
>>>>     val newSchema = StructType(schema.map {
>>>>       case StructField( c, t, _, m) if c.equals(cn) => StructField( c,
t,
>>>> nullable = nullable, m)
>>>>       case y: StructField => y
>>>>     })
>>>>     df.sqlContext.createDataFrame( df.rdd, newSchema)
>>>>   }
>>>>
>>>> Is there a cheaper solution?
>>>>
>>>> 3. *Any comments?*
>>>>
>>>> Cheers and thx in advance,
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html
>>>> Sent from the Apache Spark User List mailing list archive at
Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>
>

Re: Spark SQL DataFrame: Nullable column and filtering

Reply via email to