Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2715#discussion_r228054678
--- Diff:
integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala
---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField:
Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float,
stringDictField: String,
stringSortField: String, stringLocalDictField: String,
longStringField: String)
+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String =
"org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --
@ravipesala
Hi, I've communicated with @KanakaKumar and finally figured out the
background of your proposal.
For carbon spark datasource (fileformat), we can use âusing carbonâ in
creating table instead of specifying the whole class name for
SparkCarbonDataSource. So here you want the compressioncodec can be implemented
the same way like this -- using the short name instead of whole class name. But
there are something underlying that prevent us to do that -- the very thing
that I'm concerning is that we cannot get the real class name only from short
name.
1. In carbon spark-datasource module, we have a file named
'org.apache.spark.sql.sources.DataSourceRegister' with content
'org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat'
in the path 'resources/META-INF.services'. Spark may discover this file and
load this class, then get the short name and the whole class name for carbon
datasource.
If our compressioncodec want to be implemented like this, we will have to
provide extra property files and carbon will register the codecs after loading
and reading this file.
2. There is another way to do this : java provides a way to find all the
implementations of a specific interface. We can use this to find all the
compressioncodecs and get their shortName as well as whole class name. But the
problem may lie in that this procedure may take too much long time, especially
we spark now have so many jars in classpath.
In conclusion, I think it's OK to accept current implementation by using
the whole class name of compressioncodec. And actually spark implement it in
this way too.
---