The sizing consideration sounds reasonable. Thanks for the explanation. The scenario it's simply to use Spark to build the sketch summaries and write the summaries into postgres. Afterwards the estimation can solely be done in postgres. Since the write(spark) and read(postgres) solutions are not symmetric, the data format needs to be aligned.
I think a note somewhere in the doc to tell the user that the in/out should be base64ed is enough. Alexander Saydakov <[email protected]> 於 2020年7月7日 週二 上午3:04寫道: > Could you clarify how exactly your friend does data transfer between Spark > and PostgreSQL? > My understanding is that the data can be exported from Spark as a file, > which usually is in a printable text format. Therefore some sort of > encoding of binary data is needed. I am not familiar with the HLL extension > for PostgreSQL you are referring to. It seems to me from a quick glance > that they encode binary data as hexadecimals \xHHHH.. (looking at their > test csv files). Therefore each byte is encoded by two characters > effectively doubling the size. Base64 is a much more efficient way of > encoding binary data as printable text with the expansion ratio of 3-to-4 > (as opposed to 1-to-2). > > All these approaches are debatable, of course. In our experience, base64 > is used widely in such cases. For example, in our production systems at > Verizon Media sketches are often prepared on Hadoop clusters using Pig or > Hive, then exported in base64, and imported into Druid. > > Our documentation certainly needs improvement. Thanks for bringing this to > our attention. > > Let us know what your expectations and practices are with respect to > importing and exporting data. We will see if any changes are needed on our > side. > > > On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote: > >> Hi team, >> >> This is a newbie question. >> One of my friend in Taiwan is using Spark to write DataSketches to >> Postgres. When it comes to estimation he got the data corruption error, and >> then realize that the summary written in Postgres should be base64 encoded >> to comply with the format. >> >> >> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60 >> >> He found the other Postgres implementation of HLL does not do base64 >> though[1]. >> >> I just want to learn that what are the considerations for doing base64? >> Is it a convention that should be easy to inference or we should document >> it? >> >> Evans >> >> [1] >> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8 >> >> >>
