Re: Parquet File Naming Convention Standards

Brian Bowman Wed, 22 May 2019 11:55:43 -0700

 Thanks for the info!

HDFS is only one of many storage platforms (distributed or otherwise) that SAS 
supports.  In general larger physical files (e.g. 100MB to 1GB) with multiple 
RowGroups are also a good thing for our usage cases.  I'm working to get our 
Parquet (C to C++ via libparquet.so) writer to do this.


-Brian

On 5/22/19, 1:21 PM, "Lee, David" <[email protected]> wrote:

    EXTERNAL
    
    I'm not a big fan of this convention which is a Spark convention..
    
    A. The files should have at least "foo" in the name. Using PyArrow I would 
create these files as foo.1.parquet, foo.2.parquet, etc..
    B. These files are around 3 megs each. For HDFS storage, files should be 
sized to match the HDFS blocksize which is usually set at 128 megs (default) or 
256 megs, 512 megs, 1 gig, etc..
    
    https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
    
    I usually take small parquet files and save them as parquet row groups in a 
larger parquet file to match the HDFS blocksize.
    
    -----Original Message-----
    From: Brian Bowman <[email protected]>
    Sent: Wednesday, May 22, 2019 8:40 AM
    To: [email protected]
    Subject: Parquet File Naming Convention Standards
    
    External Email: Use caution with links and attachments
    
    
    All,
    
    Here is an example .parquet data set saved using pySpark where the 
following files are members of directory: “foo.parquet”:
    
    -rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
    -rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10 
.part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10 
.part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10 
.part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10 
.part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
    -rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10 
part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10 
part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10 
part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10 
part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    
    
    Questions:
    
      1.  Is this the “standard” for creating/saving a .parquet data set?
      2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is 
the format:
         part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an 
established convention?  Is this documented somewhere?
      3.  Is there a C++ class to create the CRC?
    
    
    Thanks,
    
    
    Brian
    
    
    This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.
    
    For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.
    
    © 2019 BlackRock, Inc. All rights reserved.

Re: Parquet File Naming Convention Standards

Reply via email to