[jira] [Commented] (KUDU-1603) Pyspark Integration

Igor Zderev (JIRA) Tue, 28 Feb 2017 02:24:58 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887740#comment-15887740
 ]


Igor Zderev commented on KUDU-1603:
-----------------------------------

There is new problem, guys! =)
I'm creating DataFrame on Kudu in pyspark:
{code}
df = kudu_df\
    .select(from_unixtime(kudu_df['sa_add_date']/1000, 
'yyyy-MM-dd').alias('date')\
            , concat_ws('—', kudu_df['sa_ec'], kudu_df['sa_ea'], 
kudu_df['sa_el']).alias('dimension')\
            , 'sa_cid' \
            , 'sa_uid')\
    .filter("""
            sa_add_date >= (unix_timestamp('2017-02-26', 'yyyy-MM-dd'))*1000
            and (
               sa_ec = 'Auth.Signin'
            or sa_ec = 'Auth.Email' 
            or sa_ec = 'Auth.Signup' 
            or sa_ec = 'Auth.Sms' 
            or sa_ec = 'Auth.Password'
                )
            """)

result = df\
    .groupBy(["date", "dimension"])\
    .agg(countDistinct('sa_cid').alias('cookies'), 
countDistinct('sa_uid').alias('wallets'))\
    .orderBy(desc('cookies'))
{code}
But when i'm execute result.show(), i'm getting some kind of magic results. In 
one case i'm getting empty dataframe. After rerunning script (with kernel 
restart), i can get count less than it should be, or empty again. I'm comparing 
to result from Impala (same query). Is this a problem of code? Or it's 
connector / Spark problem? I'm desperate. Is there anything that can be done 
about this?
Thank you for your help!

> Pyspark Integration
> -------------------
>
>                 Key: KUDU-1603
>                 URL: https://issues.apache.org/jira/browse/KUDU-1603
>             Project: Kudu
>          Issue Type: New Feature
>          Components: integration, python, spark
>            Reporter: Jordan Birdsell
>              Labels: features
>
> Now that integration with the Spark Scala/Java API has occurred, work can 
> begin on exposing this to python and integrating with pyspark.  This would 
> likely be a more desirable interface to Kudu for python for use cases, like 
> Data Science, than the current Python client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KUDU-1603) Pyspark Integration

Reply via email to