Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/4205#issuecomment-71526760
I left an initial pass of comments. I haven't really dug into the details
very much yet, but a couple of high-level comments:
- There's a lot of code duplication in the Python code that creates the
Java RDDs, so it would be nice to see if there's a way to refactor the code to
remove this duplication. My concern here is largely around future
maintainability, since I'm worried that we'll see the copies of the code
diverge when people make changes without being aware of the duplicate copies.
- I'd like to avoid repeating the `Java*Like` pattern, since it doesn't
look necessary here and it has caused problems in the past: see
https://issues.scala-lang.org/browse/SI-8905 and
https://issues.apache.org/jira/browse/SPARK-3266.
Now that we're increasingly seeing Spark libraries being written in one JVM
language and used from another (e.g. a Spark library written against the Java
API and called from Scala), it might be nice to try to extend GraphX's Scala
API to expose Java-friendly methods instead of adding a new Java API. This is
a major departure from how we've handled Java APIs up until now, but it might
be a better long-term decision for new code. I think @rxin may be able to
chime in here with more details. GraphX might be a nice context to explore
this idea since it's a much smaller API than Spark as a whole.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]