I'm not a big fan of this convention which is a Spark convention.. A. The files should have at least "foo" in the name. Using PyArrow I would create these files as foo.1.parquet, foo.2.parquet, etc.. B. These files are around 3 megs each. For HDFS storage, files should be sized to match the HDFS blocksize which is usually set at 128 megs (default) or 256 megs, 512 megs, 1 gig, etc..
https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ I usually take small parquet files and save them as parquet row groups in a larger parquet file to match the HDFS blocksize. -----Original Message----- From: Brian Bowman <[email protected]> Sent: Wednesday, May 22, 2019 8:40 AM To: [email protected] Subject: Parquet File Naming Convention Standards External Email: Use caution with links and attachments All, Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”: -rw-r--r-- 1 sasbpb r&d 8 Mar 26 12:10 ._SUCCESS.crc -rw-r--r-- 1 sasbpb r&d 25632 Mar 26 12:10 .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r-- 1 sasbpb r&d 25356 Mar 26 12:10 .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r-- 1 sasbpb r&d 26300 Mar 26 12:10 .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r-- 1 sasbpb r&d 23728 Mar 26 12:10 .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r-- 1 sasbpb r&d 0 Mar 26 12:10 _SUCCESS -rw-r--r-- 1 sasbpb r&d 3279617 Mar 26 12:10 part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r-- 1 sasbpb r&d 3244105 Mar 26 12:10 part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r-- 1 sasbpb r&d 3365039 Mar 26 12:10 part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r-- 1 sasbpb r&d 3035960 Mar 26 12:10 part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet Questions: 1. Is this the “standard” for creating/saving a .parquet data set? 2. It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID. Is the format: part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention? Is this documented somewhere? 3. Is there a C++ class to create the CRC? Thanks, Brian This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2019 BlackRock, Inc. All rights reserved.
