BTW, my friend has sent in a PR[1] to add support for intersection in Postgres. Could you help to review the PR? Where is supposed to write the UT for it?
[1] https://github.com/apache/incubator-datasketches-postgresql/pull/26 Evans Ye <[email protected]> 於 2020年7月9日 週四 上午1:36寫道: > The sizing consideration sounds reasonable. Thanks for the explanation. > > The scenario it's simply to use Spark to build the sketch summaries and > write the summaries into postgres. Afterwards the estimation can solely be > done in postgres. Since the write(spark) and read(postgres) solutions are > not symmetric, the data format needs to be aligned. > > I think a note somewhere in the doc to tell the user that the in/out > should be base64ed is enough. > > Alexander Saydakov <[email protected]> 於 2020年7月7日 週二 > 上午3:04寫道: > >> Could you clarify how exactly your friend does data transfer between >> Spark and PostgreSQL? >> My understanding is that the data can be exported from Spark as a file, >> which usually is in a printable text format. Therefore some sort of >> encoding of binary data is needed. I am not familiar with the HLL extension >> for PostgreSQL you are referring to. It seems to me from a quick glance >> that they encode binary data as hexadecimals \xHHHH.. (looking at their >> test csv files). Therefore each byte is encoded by two characters >> effectively doubling the size. Base64 is a much more efficient way of >> encoding binary data as printable text with the expansion ratio of 3-to-4 >> (as opposed to 1-to-2). >> >> All these approaches are debatable, of course. In our experience, base64 >> is used widely in such cases. For example, in our production systems at >> Verizon Media sketches are often prepared on Hadoop clusters using Pig or >> Hive, then exported in base64, and imported into Druid. >> >> Our documentation certainly needs improvement. Thanks for bringing this >> to our attention. >> >> Let us know what your expectations and practices are with respect to >> importing and exporting data. We will see if any changes are needed on our >> side. >> >> >> On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote: >> >>> Hi team, >>> >>> This is a newbie question. >>> One of my friend in Taiwan is using Spark to write DataSketches to >>> Postgres. When it comes to estimation he got the data corruption error, and >>> then realize that the summary written in Postgres should be base64 encoded >>> to comply with the format. >>> >>> >>> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60 >>> >>> He found the other Postgres implementation of HLL does not do base64 >>> though[1]. >>> >>> I just want to learn that what are the considerations for doing base64? >>> Is it a convention that should be easy to inference or we should document >>> it? >>> >>> Evans >>> >>> [1] >>> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8 >>> >>> >>>
