Re: Hive Pulsar Integration

2019-04-15 Thread gmail
I already have a simple implementation that can write data and query data.
I read the design document and implementation of kafka.
There are some differences of table partition with what I think.

I want hive table partition locations work with pulsar topics. Different table 
partitions correspond to different topics.
But i can’t get the partition where the data will be written.

I know that the drawback of doing this is that it will lose the order of the 
stream data itself.
But can reduce unnecessary data reading when querying. 

Best Regards

Penghui
Beijing,China



> 在 2019年4月13日,21:43,Jörn Franke  写道:
> 
> I think you need to develop a custom hiveserde + custom Hadoopinputformat + 
> custom Hiveoutputformat
> 
>> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail :
>> 
>> Hi guys,
>> 
>> I’m working on integration of hive and pulsar recently. But now i have 
>> encountered some problems and hope to get help here.
>> 
>> First of all, i simply describe the motivation.
>> 
>> Pulsar can be used as infinite streams for keeping both historic data and 
>> streaming data, So we want to use pulsar as a storage extension for hive.
>> In this way, hive can read the data in pulsar naturally, and can also write 
>> data into pulsar.
>> We will benefit from the same data that provides both interactive query and 
>> streaming capabilities.
>> 
>> As an improvement, support data partitioning can make the query more 
>> efficient(e.g. partition by date or any other field). 
>> 
>> But
>> 
>> - how to get hive table partition definition? 
>> - While user inert data to hive table, how to get partition the data should 
>> be store? 
>> - While use select data from hive table, how to determine data is in that 
>> partition?
>> 
>> If hive already expose some mechanism to support, please show me how to use 
>> it.
>> 
>> Best regards
>> 
>> Penghui
>> Beijing, China
>> 
>> 
>> 



Re: Hive Pulsar Integration

2019-04-13 Thread gmail
Thank you so much. 
This is too much help for me.

:)



> 在 2019年4月12日,23:46,Slim Bouguerra  写道:
> 
> Hi, Great to hear that you want to work on that!
> We have done similar work for Kafka you can look at the code and design doc
> it will help guiding for Pulsar integration.
> https://github.com/apache/hive/tree/master/kafka-handler
> https://docs.google.com/document/d/1UcXq-rrrc6cBR4MEDLOwazUhGphniJErhrwgrLDa0_I/edit
> 
> let me know if you have any questions!
> Happy coding!
> 
> On Fri, Apr 12, 2019 at 8:35 AM 李鹏辉gmail  wrote:
> 
>> Hi guys,
>> 
>> I’m working on integration of hive and pulsar recently. But now i have
>> encountered some problems and hope to get help here.
>> 
>> First of all, i simply describe the motivation.
>> 
>> Pulsar can be used as infinite streams for keeping both historic data and
>> streaming data, So we want to use pulsar as a storage extension for hive.
>> In this way, hive can read the data in pulsar naturally, and can also
>> write data into pulsar.
>> We will benefit from the same data that provides both interactive query
>> and streaming capabilities.
>> 
>> As an improvement, support data partitioning can make the query more
>> efficient(e.g. partition by date or any other field).
>> 
>> But
>> 
>> - how to get hive table partition definition?
>> - While user inert data to hive table, how to get partition the data
>> should be store?
>> - While use select data from hive table, how to determine data is in that
>> partition?
>> 
>> If hive already expose some mechanism to support, please show me how to
>> use it.
>> 
>> Best regards
>> 
>> Penghui
>> Beijing, China
>> 
>> 
>> 
>> 



Hive Pulsar Integration

2019-04-12 Thread gmail
Hi guys,

I’m working on integration of hive and pulsar recently. But now i have 
encountered some problems and hope to get help here.

First of all, i simply describe the motivation.

Pulsar can be used as infinite streams for keeping both historic data and 
streaming data, So we want to use pulsar as a storage extension for hive.
In this way, hive can read the data in pulsar naturally, and can also write 
data into pulsar.
We will benefit from the same data that provides both interactive query and 
streaming capabilities.

As an improvement, support data partitioning can make the query more 
efficient(e.g. partition by date or any other field). 

But

- how to get hive table partition definition? 
- While user inert data to hive table, how to get partition the data should be 
store? 
- While use select data from hive table, how to determine data is in that 
partition?

If hive already expose some mechanism to support, please show me how to use it.

Best regards

Penghui
Beijing, China