wangyum opened a new pull request #24422: [SPARK-27524][BUILD] Remove the parquet-provided support URL: https://github.com/apache/spark/pull/24422 ## What changes were proposed in this pull request? The Parquet file format is the default data source to use in input/output. The `parquet-provided` profile will be confusing for end users: 1. Build Spark with `parquet-provided`: ``` ./dev/make-distribution.sh --name 2.7 --tgz -Phadoop-2.7 -Phive -Pparquet-provided ``` 2. Save the ML model: ```java scala> model.save("/tmp/spark/w2v") java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:262) at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:632) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:607) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelWriter.saveImpl(Word2Vec.scala:352) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.util.MLWritable.save(ReadWrite.scala:287) at org.apache.spark.ml.util.MLWritable.save$(ReadWrite.scala:287) at org.apache.spark.ml.feature.Word2VecModel.save(Word2Vec.scala:210) ... 47 elided Caused by: java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetOutputFormat$JobSummaryLevel at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.newInstance(Class.java:412) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 71 more Caused by: java.lang.ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputFormat$JobSummaryLevel at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 76 more ``` The end users will be confused about the relationship between Parquet and ML models. ## How was this patch tested? manual tests
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
