[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

xuchuanyin Thu, 25 Oct 2018 00:05:23 -0700

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2715#discussion_r228054678
  
    --- Diff: 
integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala
 ---
    @@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: 
Short, intField: Int, bigintFi
         dateField: String, charField: String, floatField: Float, 
stringDictField: String,
         stringSortField: String, stringLocalDictField: String, 
longStringField: String)
     
    +/**
    + * This compressor actually will not compress or decompress anything.
    + * It is used for test case of specifying customized compressor.
    + */
    +class CustomizeCompressor extends Compressor {
    +  override def getName: String = 
"org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
    --- End diff --
    
    @ravipesala 
    Hi, I've communicated with @KanakaKumar and finally figured out the 
background of your proposal.
    
    For carbon spark datasource (fileformat), we can use âusing carbonâ in 
creating table instead of specifying the whole class name for 
SparkCarbonDataSource. So here you want the compressioncodec can be implemented 
the same way like this -- using the short name instead of whole class name. But 
there are something underlying that prevent us to do that -- the very thing 
that I'm concerning is that we cannot get the real class name only from short 
name.
    
    1. In carbon spark-datasource module, we have a file named 
'org.apache.spark.sql.sources.DataSourceRegister' with content 
'org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat' 
in the path 'resources/META-INF.services'. Spark may discover this file and 
load  this class, then get the short name and the whole class name for carbon 
datasource.
    
    If our compressioncodec want to be implemented like this, we will have to 
provide extra property files and carbon will register the codecs after loading 
and reading this file.
    
    2. There is another way to do this : java provides a way to find all the 
implementations of a specific interface. We can use this to find all the 
compressioncodecs and get their shortName as well as whole class name. But the 
problem may lie in that this procedure may take too much long time, especially 
we spark now have so many jars in classpath.
    
    In conclusion, I think it's OK to accept current implementation by using 
the whole class name of compressioncodec. And actually spark implement it in 
this way too.

---

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

Reply via email to