[
https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583146#comment-16583146
]
Matthew Hayes commented on DATAFU-148:
--------------------------------------
Thanks [~eyal] and [~uzadude] for submitting this and setting up the initial
Spark subproject. This looks like a great start. I look forward to seeing
more of the Spark code you have. I reviewed the code and have the following
comments:
In SparkDFUtils.scala:
- dedup2 could use some additional description to differentiate it from dedup.
- flatten is missing documentation
- for broadcastJoinSkewed, the description of the numberCustsToBroadcast field
isn't clear to me
- joinWithRange could use some more documentation. For example, the fields are
not all documented. It's not immediately obvious to me what DECREASE_FACTOR
does and why it should have a default value of 2^8.
- Also, joinWithRange seems characteristically different from the other in this
file, as it's a bit more use-case specific. Maybe later it would make sense to
move this to a separate file.
In build.gradle:
- The download plugin isn't needed.
- Is autojarring necessary? Looking at the contents of the datafu-spark jar, we
only have datafu.spark and org.apache.spark classes. It seems like
org.apache.spark classes shouldn't need to be included. Also the build.gradle
autojars commonsmath and guava, which aren't used. It seems all this jarjar and
autojar stuff could be stripped out of this file.
flattten and changeSchema should have tests I think.
A question regarding documentation: people would generally by using these via
{{DataFrameOps}}, so it would probably be helpful to have doc links in those
methods to the underlying implementation. Is the reason {{SparkDFUtils}} is
split out into a separate file so that it can be used in the future by other
methods? By the way, I found out you can generate the docs with the command
below. Before including this in a release it would be good to review the
generated docs and see where they can be improved. For example, the packages
and objects don't have docs.
{code:java}
./gradlew :datafu-spark:scaladoc
{code}
Also, if we were to merge this in somewhere it should probably go into a new
pending release branch like 2.0.0 so we can continue to work on getting it
ready independent of short term releases. I think this should trigger a major
version bump since it is a new sub-project and gives us the chance to clean up
anything we've deprecated. Thoughts?
> Setup Spark sub-project
> -----------------------
>
> Key: DATAFU-148
> URL: https://issues.apache.org/jira/browse/DATAFU-148
> Project: DataFu
> Issue Type: New Feature
> Reporter: Eyal Allweil
> Assignee: Eyal Allweil
> Priority: Major
>
> Create a skeleton Spark sub project for Spark code to be contributed to DataFu
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)