Fwd: CRAN submission SparkR 3.2.0

2021-10-20 Thread Felix Cheung
-- Forwarded message -
From: Gregor Seyer 
Date: Wed, Oct 20, 2021 at 4:42 AM
Subject: Re: CRAN submission SparkR 3.2.0
To: Felix Cheung , CRAN <
cran-submissi...@r-project.org>


Thanks,

Please add \value to .Rd files regarding exported methods and explain
the functions results in the documentation. Please write about the
structure of the output (class) and also what the output means. (If a
function does not return a value, please document that too, e.g.
\value{No return value, called for side effects} or similar)
Missing Rd-tags in up to 102 .Rd files, e.g.:
  attach.Rd: \value
  avg.Rd: \value
  between.Rd: \value
  cache.Rd: \value
  cancelJobGroup.Rd: \value
  cast.Rd: \value
  ...

You have examples for unexported functions.
array_transform() in:
   hashCode.Rd
  Please either omit these examples or export the functions.

Warning: Unexecutable code in man/sparkR.session.Rd:
   sparkR.session(spark.master = "yarn", spark.submit.deployMode =
"client",:
Warning: Unexecutable code in man/write.stream.Rd:
   partitionBy:

Please do not modify the .GlobalEnv. This is not allowed by the CRAN
policies. e.g.: inst/profile/shell.R

Please do not modify the global environment (e.g. by using <<-) in your
functions. This is not allowed by the CRAN policies.  e.g.: R/utils.R


Additionally:
Have the issues why your package was archived been fixed?
Please explain this in the submission comments.


Please fix and resubmit.

Best,
Gregor Seyer

Am 19.10.21 um 19:48 schrieb CRAN submission:
> [This was generated from CRAN.R-project.org/submit.html]
>
> The following package was uploaded to CRAN:
> ===
>
> Package Information:
> Package: SparkR
> Version: 3.2.0
> Title: R Front End for 'Apache Spark'
> Author(s): Shivaram Venkataraman [aut], Xiangrui Meng [aut], Felix Cheung
>[aut, cre], The Apache Software Foundation [aut, cph]
> Maintainer: Felix Cheung 
> Depends: R (>= 3.5), methods
> Suggests: knitr, rmarkdown, markdown, testthat, e1071, survival, arrow
>(>= 1.0.0)
> Description: Provides an R Front end for 'Apache Spark'
>.
> License: Apache License (== 2.0)
>
>
> The maintainer confirms that he or she
> has read and agrees to the CRAN policies.
>
> =
>
> Original content of DESCRIPTION file:
>
> Package: SparkR
> Type: Package
> Version: 3.2.0
> Title: R Front End for 'Apache Spark'
> Description: Provides an R Front end for 'Apache Spark' <
https://spark.apache.org>.
> Authors@R: c(person("Shivaram", "Venkataraman", role = "aut",
>  email = "shiva...@cs.berkeley.edu"),
>   person("Xiangrui", "Meng", role = "aut",
>  email = "m...@databricks.com"),
>   person("Felix", "Cheung", role = c("aut", "cre"),
>  email = "felixche...@apache.org"),
>   person(family = "The Apache Software Foundation", role =
c("aut", "cph")))
> License: Apache License (== 2.0)
> URL: https://www.apache.org https://spark.apache.org
> BugReports: https://spark.apache.org/contributing.html
> SystemRequirements: Java (>= 8, < 12)
> Depends: R (>= 3.5), methods
> Suggests: knitr, rmarkdown, markdown, testthat, e1071, survival, arrow
>  (>= 1.0.0)
> Collate: 'schema.R' 'generics.R' 'jobj.R' 'column.R' 'group.R' 'RDD.R'
>  'pairRDD.R' 'DataFrame.R' 'SQLContext.R' 'WindowSpec.R'
>  'backend.R' 'broadcast.R' 'catalog.R' 'client.R' 'context.R'
>  'deserialize.R' 'functions.R' 'install.R' 'jvm.R'
>  'mllib_classification.R' 'mllib_clustering.R' 'mllib_fpm.R'
>  'mllib_recommendation.R' 'mllib_regression.R' 'mllib_stat.R'
>  'mllib_tree.R' 'mllib_utils.R' 'serialize.R' 'sparkR.R'
>  'stats.R' 'streaming.R' 'types.R' 'utils.R' 'window.R'
> RoxygenNote: 7.1.1
> VignetteBuilder: knitr
> NeedsCompilation: no
> Encoding: UTF-8
> Packaged: 2021-10-06 13:15:21 UTC; spark-rm
> Author: Shivaram Venkataraman [aut],
>Xiangrui Meng [aut],
>Felix Cheung [aut, cre],
>The Apache Software Foundation [aut, cph]
> Maintainer: Felix Cheung 
>


Re: Random expr in join key not support

2021-10-20 Thread Yingyi Bu
> Do you mean something like this:
> select * from t1 join (select floor(random()*9) + id as x from t2) m on
t1.id = m.x ?
> Yes, that works.

Yes.

> But that raise another question: theses two queries seem semantically
equivalent,
> yet we treat them differently: one raises an analysis exception, one can
work well.
> Should we treat them equally?

They're not semantically equivalent, according to the SQL spec. See page
241 in SQL-99 spec (http://web.cecs.pdx.edu/~len/sql1999.pdf) - the general
rules for .

> Here the purpose to add a random in join key is to resolve the data skew
problem.

Would you mind briefly elaborating what you're trying to do to reduce skew?

Best,
Yingyi


On Tue, Oct 19, 2021 at 9:07 PM Ye Xianjin  wrote:

> > For that, you can add a table subquery and do it in the select list.
>
> Do you mean something like this:
> select * from t1 join (select floor(random()*9) + id as x from t2) m on
> t1.id = m.x ?
>
> Yes, that works. But that raise another question: theses two queries seem
> semantically equivalent, yet we treat them differently: one raises an
> analysis exception, one can work well.
> Should we treat them equally?
>
>
>
>
> Sent from my iPhone
>
> On Oct 20, 2021, at 9:55 AM, Yingyi Bu  wrote:
>
> 
> Per SQL spec, I think your join query can only be run as a NestedLoopJoin
> or CartesianProduct.  See page 241 in SQL-99 (
> http://web.cecs.pdx.edu/~len/sql1999.pdf).
> In other words, it might be a correctness bug in other systems if they run
> your query as a hash join.
>
> > Here the purpose of adding a random in join key is to resolve the data
> skew problem.
>
> For that, you can add a table subquery and do it in the select list.
>
> Best,
> Yingyi
>
>
> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin  wrote:
>
>> In PostgreSQL and Presto, the below query works well
>> sql> create table t1 (id int);
>> sql> create table t2 (id int);
>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>>
>> But it throws "Error in query: nondeterministic expressions are only
>> allowed in Project, Filter, Aggregate or Window". Why Spark doesn't support
>> random expressions in join condition?
>> Here the purpose to add a random in join key is to resolve the data skew
>> problem.
>>
>> Thanks,
>> Lantao
>>
>