I merged the pull request into master. Thank you! On Wed, Jul 8, 2020 at 10:41 AM Evans Ye <[email protected]> wrote:
> BTW, my friend has sent in a PR[1] to add support for intersection in > Postgres. Could you help to review the PR? Where is supposed to write the > UT for it? > > [1] https://github.com/apache/incubator-datasketches-postgresql/pull/26 > > > Evans Ye <[email protected]> 於 2020年7月9日 週四 上午1:36寫道: > >> The sizing consideration sounds reasonable. Thanks for the explanation. >> >> The scenario it's simply to use Spark to build the sketch summaries and >> write the summaries into postgres. Afterwards the estimation can solely be >> done in postgres. Since the write(spark) and read(postgres) solutions are >> not symmetric, the data format needs to be aligned. >> >> I think a note somewhere in the doc to tell the user that the in/out >> should be base64ed is enough. >> >> Alexander Saydakov <[email protected]> 於 2020年7月7日 週二 >> 上午3:04寫道: >> >>> Could you clarify how exactly your friend does data transfer between >>> Spark and PostgreSQL? >>> My understanding is that the data can be exported from Spark as a file, >>> which usually is in a printable text format. Therefore some sort of >>> encoding of binary data is needed. I am not familiar with the HLL extension >>> for PostgreSQL you are referring to. It seems to me from a quick glance >>> that they encode binary data as hexadecimals \xHHHH.. (looking at their >>> test csv files). Therefore each byte is encoded by two characters >>> effectively doubling the size. Base64 is a much more efficient way of >>> encoding binary data as printable text with the expansion ratio of 3-to-4 >>> (as opposed to 1-to-2). >>> >>> All these approaches are debatable, of course. In our experience, base64 >>> is used widely in such cases. For example, in our production systems at >>> Verizon Media sketches are often prepared on Hadoop clusters using Pig or >>> Hive, then exported in base64, and imported into Druid. >>> >>> Our documentation certainly needs improvement. Thanks for bringing this >>> to our attention. >>> >>> Let us know what your expectations and practices are with respect to >>> importing and exporting data. We will see if any changes are needed on our >>> side. >>> >>> >>> On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote: >>> >>>> Hi team, >>>> >>>> This is a newbie question. >>>> One of my friend in Taiwan is using Spark to write DataSketches to >>>> Postgres. When it comes to estimation he got the data corruption error, and >>>> then realize that the summary written in Postgres should be base64 encoded >>>> to comply with the format. >>>> >>>> >>>> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60 >>>> >>>> He found the other Postgres implementation of HLL does not do base64 >>>> though[1]. >>>> >>>> I just want to learn that what are the considerations for doing base64? >>>> Is it a convention that should be easy to inference or we should document >>>> it? >>>> >>>> Evans >>>> >>>> [1] >>>> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8 >>>> >>>> >>>>
