Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/146#issuecomment-38093365
Hey Michael, the new approach looks quite good to me. I noticed a few more
packaging changes that we should make, but maybe it's okay to push some of
these after merging the initial PR:
# There seem to be some examples in the core package (e.g.
http://people.apache.org/%7Epwendell/catalyst-docs-03-18/api/sql/core/#org.apache.spark.sql.examples.SchemaRddExample$)
-- these should go in `examples`
# The docs still say loadFile and writeToFile instead of parquetFile and
saveAsParquetFile, and don't show the new way of creating schema RDDs
# Some filenames don't match the class inside, e.g. SparkSQLContext. Some
are also lowercase, e.g. generators.scala -- if it's a class and many small
subclasses, you can call it Generator.scala and just have the subclasses there.
Or move them to different files, it's not a big deal.
# The POMs say `<url>http://spark-project.org/</url>` instead of
`spark.apache.org` -- maybe this was copied from an old POM that is also wrong
# One important code style comment: we don't use relative package names in
Spark (e.g. import org.apache.spark and then import catalyst). This seems to be
used in many files.
Regarding the case-sensitive keywords, apparently you can use a regex
instead of a string to avoid this:
http://stackoverflow.com/questions/6080437/case-insensitive-scala-parser-combinator.
Regarding the transient SQLContext in spark-shell, do you know what's
bringing it in? If it doesn't get used in the actual computation, maybe we can
just make it Serializable. I'm surprised this happens because SparkContext, for
example, is not Serializable and does not get pulled in.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---