I merged the pull request into master. Thank you!

On Wed, Jul 8, 2020 at 10:41 AM Evans Ye <[email protected]> wrote:

> BTW, my friend has sent in a PR[1] to add support for intersection in
> Postgres. Could you help to review the PR? Where is supposed to write the
> UT for it?
>
> [1] https://github.com/apache/incubator-datasketches-postgresql/pull/26
>
>
> Evans Ye <[email protected]> 於 2020年7月9日 週四 上午1:36寫道:
>
>> The sizing consideration sounds reasonable. Thanks for the explanation.
>>
>> The scenario it's simply to use Spark to build the sketch summaries and
>> write the summaries into postgres. Afterwards the estimation can solely be
>> done in postgres. Since the write(spark) and read(postgres) solutions are
>> not symmetric, the data format needs to be aligned.
>>
>> I think a note somewhere in the doc to tell the user that the in/out
>> should be base64ed is enough.
>>
>> Alexander Saydakov <[email protected]> 於 2020年7月7日 週二
>> 上午3:04寫道:
>>
>>> Could you clarify how exactly your friend does data transfer between
>>> Spark and PostgreSQL?
>>> My understanding is that the data can be exported from Spark as a file,
>>> which usually is in a printable text format. Therefore some sort of
>>> encoding of binary data is needed. I am not familiar with the HLL extension
>>> for PostgreSQL you are referring to. It seems to me from a quick glance
>>> that they encode binary data as hexadecimals \xHHHH.. (looking at their
>>> test csv files). Therefore each byte is encoded by two characters
>>> effectively doubling the size. Base64 is a much more efficient way of
>>> encoding binary data as printable text with the expansion ratio of 3-to-4
>>> (as opposed to 1-to-2).
>>>
>>> All these approaches are debatable, of course. In our experience, base64
>>> is used widely in such cases. For example, in our production systems at
>>> Verizon Media sketches are often prepared on Hadoop clusters using Pig or
>>> Hive, then exported in base64, and imported into Druid.
>>>
>>> Our documentation certainly needs improvement. Thanks for bringing this
>>> to our attention.
>>>
>>> Let us know what your expectations and practices are with respect to
>>> importing and exporting data. We will see if any changes are needed on our
>>> side.
>>>
>>>
>>> On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote:
>>>
>>>> Hi team,
>>>>
>>>> This is a newbie question.
>>>> One of my friend in Taiwan is using Spark to write DataSketches to
>>>> Postgres. When it comes to estimation he got the data corruption error, and
>>>> then realize that the summary written in Postgres should be base64 encoded
>>>> to comply with the format.
>>>>
>>>>
>>>> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60
>>>>
>>>> He found the other Postgres implementation of HLL does not do base64
>>>> though[1].
>>>>
>>>> I just want to learn that what are the considerations for doing base64?
>>>> Is it a convention that should be easy to inference or we should document
>>>> it?
>>>>
>>>> Evans
>>>>
>>>> [1]
>>>> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8
>>>>
>>>>
>>>>

Reply via email to