BTW, my friend has sent in a PR[1] to add support for intersection in
Postgres. Could you help to review the PR? Where is supposed to write the
UT for it?

[1] https://github.com/apache/incubator-datasketches-postgresql/pull/26


Evans Ye <[email protected]> 於 2020年7月9日 週四 上午1:36寫道:

> The sizing consideration sounds reasonable. Thanks for the explanation.
>
> The scenario it's simply to use Spark to build the sketch summaries and
> write the summaries into postgres. Afterwards the estimation can solely be
> done in postgres. Since the write(spark) and read(postgres) solutions are
> not symmetric, the data format needs to be aligned.
>
> I think a note somewhere in the doc to tell the user that the in/out
> should be base64ed is enough.
>
> Alexander Saydakov <[email protected]> 於 2020年7月7日 週二
> 上午3:04寫道:
>
>> Could you clarify how exactly your friend does data transfer between
>> Spark and PostgreSQL?
>> My understanding is that the data can be exported from Spark as a file,
>> which usually is in a printable text format. Therefore some sort of
>> encoding of binary data is needed. I am not familiar with the HLL extension
>> for PostgreSQL you are referring to. It seems to me from a quick glance
>> that they encode binary data as hexadecimals \xHHHH.. (looking at their
>> test csv files). Therefore each byte is encoded by two characters
>> effectively doubling the size. Base64 is a much more efficient way of
>> encoding binary data as printable text with the expansion ratio of 3-to-4
>> (as opposed to 1-to-2).
>>
>> All these approaches are debatable, of course. In our experience, base64
>> is used widely in such cases. For example, in our production systems at
>> Verizon Media sketches are often prepared on Hadoop clusters using Pig or
>> Hive, then exported in base64, and imported into Druid.
>>
>> Our documentation certainly needs improvement. Thanks for bringing this
>> to our attention.
>>
>> Let us know what your expectations and practices are with respect to
>> importing and exporting data. We will see if any changes are needed on our
>> side.
>>
>>
>> On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote:
>>
>>> Hi team,
>>>
>>> This is a newbie question.
>>> One of my friend in Taiwan is using Spark to write DataSketches to
>>> Postgres. When it comes to estimation he got the data corruption error, and
>>> then realize that the summary written in Postgres should be base64 encoded
>>> to comply with the format.
>>>
>>>
>>> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60
>>>
>>> He found the other Postgres implementation of HLL does not do base64
>>> though[1].
>>>
>>> I just want to learn that what are the considerations for doing base64?
>>> Is it a convention that should be easy to inference or we should document
>>> it?
>>>
>>> Evans
>>>
>>> [1]
>>> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8
>>>
>>>
>>>

Reply via email to