Re: Should CarbonData need to integrate with Spark Streaming too?

Jacky Li Tue, 16 Jan 2018 18:17:35 -0800


> 在 2018年1月17日，上午1:38，xm_zzc <441586...@qq.com> 写道：
> 
> Hi dev:
>  Currently CarbonData 1.3(will be released soon) just support to integrate
> with Spark Structured Streaming which requires Kafka's version must be >=
> 0.10. I think there are still many users  integrating Spark Streaming with
> kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
> much. So should CarbonData need to integrate with Spark Streaming too?
> 
>  I think there are two ways to integrate with Spark Streaming, as
> following:
>  1). CarbonData batch data loading + Auto compaction
>  Use CarbonSession.createDataFrame to convert rdd to DataFrame in
> InputDStream.foreachRDD, and then save rdd data into CarbonData table which
> support auto compaction. In this way, it can support to create pre-aggregate
> tables on this main table too (Streaming table does not support to create
> pre-aggregate tables on it).
> 
>  I can test with this way in our QA env and add example to CarbonData.


This approach is doable, but the loading interval should be relative longer 
since it still uses columnar file in this approach. I am not sure how frequent 
you do one batch load?

> 
>  2). The same as integration with Structured Streaming
>  With this way, Structured Streaming append every mini-batch data into
> stream segment which is row format, and then when the size of stream segment
> is greater than 'carbon.streaming.segment.max.size', it will auto convert
> stream segment to batch segment(column format) at the begin of each batch
> and create a new stream segment to append data.
>  However, I have no idea how to integrate with Spark Streaming yet, *any
> suggestion for this*? 
> 

You can refer to the logic in CarbonAppendableStreamSink.addBatch, basically it 
launches a job to do appending to row format files in the streaming segment by 
invoking CarbonAppendableStreamSink.writeDataFileJob. At beginning, you can 
invoke checkOrHandOffSegment to create the streaming segment.
I think integrate with the SparkStreaming is a good feature to have, it enables 
more user to use carbon streaming ingest feature on existing cluster setting 
with old spark and Kafka version.
Please feel free to create JIRA ticket and discuss in the community.

> 
> --
> Sent from: 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Should CarbonData need to integrate with Spark Streaming too?

Reply via email to