Hello,


I have been working on a custom compression algorithm for market trading data. 
The data is quite big (goes to PBs) and so savings in storage are some visible 
cost savings. I used ORC as the baseline, and extended it by creating custom 
encoders for different types of data. The encoders are not meant as the 
replacement for the standard ORC encoders but rather are use-case specific, 
exploiting known redundancies in the data (eg. predicting value of one fields 
based on the others). I was able to achieve pretty good improvements (about 
48%) over standard ORC for my type of data.
Currently, I had to fork and create my own version of the ORC library (Java), 
which is not ideal. If there are any improvements, it will require merge. Also, 
it's hard to integrate this into other higher-level frameworks, such as Spark. 
And other people can't use my work. My target is actually to be able to use 
this codec in Databricks.
By looking at the implementation I had a thought that it would be nice to have 
some sort of extensibility mechanism standard as part of ORC (Java). Based on a 
column type, and perhaps some configuration, to be able to overwrite the 
standard "Writer" for certain types. For example, I have an improved 
"Timestamp" writer which exploits some patterns in the data (see-saw pattern), 
which could be applicable to other data as well. It would be nice if I could 
replace the standard writer for certain fields without the need to modify the 
ORC library, or people could opt-out to use my encoder for their data. And 
ideally, be able to simply load my library alongside the default ORC 
implementation into Spark, and have my "plugins" or "extensions" automatically 
discovered by ORC and integrated.
Has anybody thought about anything similar? Would that work? Would it be 
beneficial? What's the best way to implement something like that, where would 
you start?
Thanks,
Denis

   

Reply via email to