RE: Parquet File Naming Convention Standards

Lee, David Wed, 22 May 2019 10:22:02 -0700

I'm not a big fan of this convention which is a Spark convention..

A. The files should have at least "foo" in the name. Using PyArrow I would 
create these files as foo.1.parquet, foo.2.parquet, etc..
B. These files are around 3 megs each. For HDFS storage, files should be sized 
to match the HDFS blocksize which is usually set at 128 megs (default) or 256 
megs, 512 megs, 1 gig, etc..

https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

I usually take small parquet files and save them as parquet row groups in a 
larger parquet file to match the HDFS blocksize.

-----Original Message-----
From: Brian Bowman <[email protected]> 
Sent: Wednesday, May 22, 2019 8:40 AM
To: [email protected]
Subject: Parquet File Naming Convention Standards 

External Email: Use caution with links and attachments

All,

Here is an example .parquet data set saved using pySpark where the following 
files are members of directory: “foo.parquet”:

-rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10 
.part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10 
.part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10 
.part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10 
.part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
-rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10 
part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10 
part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10 
part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10 
part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet

Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the 
format:
     part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an 
established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?

Thanks,

Brian

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.

RE: Parquet File Naming Convention Standards

Reply via email to