All,
Here is an example .parquet data set saved using pySpark where the following
files are members of directory: “foo.parquet”:
-rw-r--r-- 1 sasbpb r&d 8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r-- 1 sasbpb r&d 25632 Mar 26 12:10
.part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r-- 1 sasbpb r&d 25356 Mar 26 12:10
.part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r-- 1 sasbpb r&d 26300 Mar 26 12:10
.part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r-- 1 sasbpb r&d 23728 Mar 26 12:10
.part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r-- 1 sasbpb r&d 0 Mar 26 12:10 _SUCCESS
-rw-r--r-- 1 sasbpb r&d 3279617 Mar 26 12:10
part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r-- 1 sasbpb r&d 3244105 Mar 26 12:10
part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r-- 1 sasbpb r&d 3365039 Mar 26 12:10
part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r-- 1 sasbpb r&d 3035960 Mar 26 12:10
part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
Questions:
1. Is this the “standard” for creating/saving a .parquet data set?
2. It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID. Is the
format:
part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc
an established convention? Is this documented somewhere?
3. Is there a C++ class to create the CRC?
Thanks,
Brian