[
https://issues.apache.org/jira/browse/KUDU-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887740#comment-15887740
]
Igor Zderev commented on KUDU-1603:
-----------------------------------
There is new problem, guys! =)
I'm creating DataFrame on Kudu in pyspark:
{code}
df = kudu_df\
.select(from_unixtime(kudu_df['sa_add_date']/1000,
'yyyy-MM-dd').alias('date')\
, concat_ws('—', kudu_df['sa_ec'], kudu_df['sa_ea'],
kudu_df['sa_el']).alias('dimension')\
, 'sa_cid' \
, 'sa_uid')\
.filter("""
sa_add_date >= (unix_timestamp('2017-02-26', 'yyyy-MM-dd'))*1000
and (
sa_ec = 'Auth.Signin'
or sa_ec = 'Auth.Email'
or sa_ec = 'Auth.Signup'
or sa_ec = 'Auth.Sms'
or sa_ec = 'Auth.Password'
)
""")
result = df\
.groupBy(["date", "dimension"])\
.agg(countDistinct('sa_cid').alias('cookies'),
countDistinct('sa_uid').alias('wallets'))\
.orderBy(desc('cookies'))
{code}
But when i'm execute result.show(), i'm getting some kind of magic results. In
one case i'm getting empty dataframe. After rerunning script (with kernel
restart), i can get count less than it should be, or empty again. I'm comparing
to result from Impala (same query). Is this a problem of code? Or it's
connector / Spark problem? I'm desperate. Is there anything that can be done
about this?
Thank you for your help!
> Pyspark Integration
> -------------------
>
> Key: KUDU-1603
> URL: https://issues.apache.org/jira/browse/KUDU-1603
> Project: Kudu
> Issue Type: New Feature
> Components: integration, python, spark
> Reporter: Jordan Birdsell
> Labels: features
>
> Now that integration with the Spark Scala/Java API has occurred, work can
> begin on exposing this to python and integrating with pyspark. This would
> likely be a more desirable interface to Kudu for python for use cases, like
> Data Science, than the current Python client.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)