Here's my comment and how I'm generating 128 meg parquet files. This takes into account file sizes after compression and dictionary encoding.
https://issues.apache.org/jira/browse/ARROW-3728?focusedCommentId=16703544&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16703544 Would be nice to have a merge() parquet file function that does something similar to create parquet files which match HDFS block sizes. -----Original Message----- From: Jiayuan Chen <[email protected]> Sent: Monday, December 10, 2018 2:30 PM To: [email protected] Subject: parquet-arrow estimate file size External Email: Use caution with links and attachments Hello, I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow. My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post ( https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=) and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file. I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap? Thanks, Jiayuan This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
