Hi Jason and Tim,

Thanks for the detailed response.
            The reason to split a large parquet file into small chunks is
to be able to store them in Ceph as RADOS (distributed object backend for
ceph) objects (where each object can't be larger than a few ten MBs). Next,
the reason to split the large parquet files in such a way that makes each
small chunk self-contained in terms of full row groups, is that we want to
push down filter and projection operations to the storage node and inside a
storage node context we couldn't read across objects (as two row-group
objects can be present in different storage nodes) which is otherwise
possible when applying projection and filters on the client-side using a
filesystem abstraction. Now with row group objects, I can utilize the
statistics in the footer metadata to map the column chunk offsets to RADOS
objects (that have the row groups containing that column chunk) and read
column chunks from the objects in the storage device memory by converting
the file-wide column chunk offsets to object/row group-wide offsets, thus
maintaining the parquet optimizations. I hope that gives you
a brief background regarding my issue.

             Also, since I am working in a Ceph environment (outside of the
Hadoop environment), so the `parquet.block.size` parameter doesn't hold for
me. So, I was wondering whether the `parquet::ParquetFileWriter` API in the
arrow codebase already allows specifying a block size to write padded
fixed-size row groups to match a given block size while writing a parquet
file which can then be easily chunked using a Linux utility like `split`
for example. Or, do I have to implement a custom `ParquetWriter` similar to
what is present in `parquet-hadoop` to do the chunking and padding? If I
could end up having such an API, I could split a large parquet file into
well-aligned fixed-size objects containing single row groups (analogous to
block in HDFS) and store them in the Ceph object-store and basically
replicate the Hadoop + HDFS scenario on the CephFS + RADOS stack but with
the added capability to push down filters and projections to the storage
layer.

On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong
<[email protected]> wrote:

> It seems like you would be best off writing out N separate parquet files of
> the desired size. That seems better than having N files with one row group
> each and a shared footer that you have to stitch together to read. I guess
> there would be a small amount of redundancy between footer contents, but
> that wouldn't count for much in the scheme of things. If you have partial
> parquet files without a footer, you lose the self-describing/self-contained
> nature of Parquet files, like Jason said.
>
> I guess I'm not sure if parquet-mr or whatever you're using to write
> parquet has an option to start a new file at each row group boundary, but
> that seems like it would probably solve your problem.
>
>
>
>
>
> On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <[email protected]>
> wrote:
>
> > Hi Jayjeet,
> >
> > Is there a particular reason that you need to spread out data into
> > multiple small files? On HDFS at least there are longstanding
> > scalability issues with having lots of smaller files around, so there
> > generally is a move to concatenate together smaller files. Even with
> larger
> > files, the various common querying mechanisms, Spark, MapReduce, Hive,
> > Impala, etc. will all allow parallelizing reads by blocks, which when
> > configured properly should correspond to parquet row groups.
> >
> > The size of a row group is fixed by the setting of parquet.block.size.
> You
> > mentioned alignment, and pretty early on a padding feature was added to
> > parquet to ensure that row groups would try to end on the true HDFS block
> > boundaries, to avoid the need to read across blocks when accessing a row
> > group (because row groups have to contain full rows, it is unlikely you
> > will end with exactly the right number of bytes in the row group to match
> > the end of the HFS block).
> >
> > https://github.com/apache/parquet-mr/pull/211
> >
> > So to your specific proposal, it currently isn't possible to detach the
> > footer from the file that contains the actual data in the row groups,
> but I
> > think that is a good property, it means everything for that data to be
> read
> > is fully contained in one file that can be moved/renamed safely.
> >
> > There are some systems that elect to write only a single row group per
> > file, because HDFS doesn't allow rewriting data in place. Doing this
> > enables use cases where individual rows need to be deleted or updated to
> be
> > accomplished by re-writing smaller files, instead of needing to read in
> and
> > write back out a large file containing many row groups when only a single
> > row group's data has changed.
> >
> > - Jason
> >
> >
> > On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> > [email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I am trying to figure out if a large Parquet file can be striped across
> > > multiple small files based on a Row group chunk size where each stripe
> > > would naturally end up containing data pages from a single row group.
> So,
> > > if I say my writer "write a parquet file in chunks of 128 MB (assuming
> my
> > > row groups are of around 128MB), each of my chunks ends up being
> > > self-contained row group, maybe except the last chunk which has the
> > footer
> > > contents. Is this possible? Can we fix the row group size (the amount
> of
> > > disk space a row group uses) while writing parquet files ? Thanks a
> lot.
> > >
> >
>


-- 
*Jayjeet Chakraborty*
4th Year, B.Tech, Undergraduate
Department Of Computer Sc. And Engineering
National Institute Of Technology, Durgapur
PIN: 713205, West Bengal, India
M: (+91) 8436500886

Reply via email to